Ideas and Roadmap
- My best wish is that a community of interested people will form around the tool, who would like to help determine the priorities for further development. E.g. through simple votes in Discord, or liking Features in GitHub.
- I would like to make the tool as easy to use as possible, so that it can be used by people who are not programmers. At the same time, I want even the most demanding users to find everything important in it.
- I also care about the high usability of the function of exporting the website to offline form. Unfortunately, it turned out that it is practically impossible to run modern websites using JS modules in offline mode using the
file://
protocol due to CORS, which does not allow JS modules with thefile://
protocol in the default settings of browsers. I didnโt have much success even with attempts to inline the JS module directly into HTML. It brought its own set of other problems. I will try to solve this problem in the future, but I am not sure if it will be possible. If you have any ideas, please let me know. For modern websites built on JS frameworks with SSR and hydration, it is therefore better to generate offline export with--disable-javascript
, or at least--remove-anchor-listeners
. - Already during the design of the offline export, I thought of a very useful function - to create such a version of the export, together with the corresponding configuration of Nginx/Apache/Varnish/Traefik vhost, which could work as a legitimate static mirror of a website with URL compliance, of course without real backend functionality ( forms, etc.). For websites using GraphQL, for example, Nginx vhost would also contain Lua scripts to handle these POST requests. For some of our projects, it would be great if, through GeoDNS services with auto-failover mechanisms, we could, for example, direct visitors to a completely static version of the website on our CDN in the event of an extreme DDoS attack or unexpected outages. I imagine this more robust feature would also find use for some users in the world.
- An important function that is still missing is a system of measurable quality assessment of individual areas from 0 to 100 with the possibility to set thresholds according to which the crawler would terminate with an exit code reporting whether the website passed the tests or not. It will be useful in CI/CD pipelines.
- Another relatively simple and useful improvement will be the release of a public Docker image with which it will be possible to start crawling with a single docker run command.
- Most of the problems found relate to specific pages or even specific elements on the page (too bulky inline svgs, incorrect heading structure, essential elements without aria attributes, etc.). I want to create a special variant of the offline website export in the sense of โlive viewing of errorsโ, where it will be possible to browse the entire website in the offline version, but on specific pages there will be colored/flashing highlighted elements with a tooltip detailing the error. Other errors (e.g. headers) would be displayed in semi-transparent block somewhere in the header of the page.
- To make the individual findings and tables easier to interpret, I want to add a hint to each table/section as to why that area is important and how to make improvements.
- Lot of tables in the HTML report contain aggregated statistical data. I want to add some light form of graphs to the tables in the report for faster orientation (pies, lines, bars)
- Currently, the Crawler does not interpret JavaScript, so pages designed exclusively as SPA (Single-Page Application) without functional SSR (Server-Sider Rendering) may not work correctly. In the future, we are considering adding the option for advanced users to interpret JavaScript through Chromium/Puppeteer running in your Docker.
- So far, only a few of the important checks have been implemented in each of the areas. In the coming weeks, we will implement more, or consolidate and improve existing ones.
- At the beginning of the development, I said to myself that I donโt want to have any external dependencies in the code or the need to run the installation of dependencies. Thatโs why I decided not to use any framework. With further development, I would reconsider this, since most users will use the pre-built package or UI installer for their platform, and advanced users will be able to install the dependencies without problems. The templating language will then make it easier to compose an HTML report and it will be easier to involve other developers as well.
- We love Swoole, which is also used by Crawler, and have been using it for many years with unrivaled performance and perfect stability for many of our microservices. Swoole works great on Linux or macOS, but uses Cygwin for its Windows-compatible runtime, and this combination of technologies is not very stable. Going forward I will consider switching to ReactPHP or in the longer future Rust. But I will also try to resolve this instability with the Swoole developers.
- For data synchronization between Swoole threads, Swoole Tables are used. Their disadvantage is the need to pre-allocate memory for a specific number of rows. For this reason
--max-queue-length
,--max-visited-urls
and--max-url-length
parameters were created. In the case of extremely large websites with hundreds of thousands or millions of URLs, it will be necessary to increase these parameters and it will also require a lot of memory. For this reason, we will probably use SQLite or a similar database in the future. It makes sense for other reasons as well. For example, it will be possible to resume crawling after an unexpected interruption. Even now it is possible in a way thanks to the HTTP cache.