Website to Markdown Converter
The SiteOne Crawler can export or convert an entire website with all subpages to browsable markdown. This is particularly useful for feeding website content (like documentation) into AI tools that often handle markdown more effectively than raw HTML.
Features
Section titled “Features”- Exports the entire website with all subpages to browsable markdown.
- Optionally includes images and other files (PDF, etc.).
- Allows removing unwanted elements from the exported markdown using CSS selectors.
- Can move content before the main H1 heading to the end of the markdown.
- Implements code block detection and syntax highlighting.
- Converts HTML tables to markdown tables.
- Can combine all exported markdown files into a single large markdown file.
- Includes smart removal of duplicate website headers and footers in the combined single markdown file.
Command-line Options
Section titled “Command-line Options”| Parameter | Description | Default |
|---|---|---|
--markdown-export-dir | Path to directory where to save the markdown version of the website. Directory will be created if it doesn’t exist. | |
--markdown-export-single-file | Path to a file where to save the combined markdown files into one document. Requires --markdown-export-dir to be set. Ideal for AI tools that need to process the entire website content in one go. | |
--markdown-move-content-before-h1-to-end | Move all content before the main H1 heading (typically the header with the menu) to the end of the markdown. | |
--markdown-disable-images | Do not export and show images in markdown files. Images are enabled by default. | |
--markdown-disable-files | Do not export and link files other than HTML/CSS/JS/fonts/images - eg. PDF, ZIP, etc. These files are enabled by default. | |
--markdown-remove-links-and-images-from-single-file | Remove links and images from the combined single markdown file. Useful for AI tools that don’t need these elements. Requires --markdown-export-single-file to be set. | |
--markdown-exclude-selector | Exclude some page content (DOM elements) from markdown export defined by CSS selectors like ‘header’, ‘.header’, ‘#header’, etc. Can be specified multiple times. | |
--markdown-replace-content | Replace text content with foo -> bar or regexp in PCRE format: /card[0-9]/i -> card. | |
--markdown-replace-query-string | Instead of using a short hash instead of a query string in the filename, just replace some characters. You can use simple format ‘foo -> bar’ or regexp in PCRE format, e.g. /([a-z]+)=([^&]*)(&|$)/i -> $1__$2. | |
--markdown-export-store-only-url-regex | For debug - when filled it will activate debug mode and store only URLs which match one of these PCRE regexes. Can be specified multiple times. | |
--markdown-ignore-store-file-error | Ignores any file storing errors. The export process will continue. |
Standalone HTML-to-Markdown conversion
Section titled “Standalone HTML-to-Markdown conversion”You don’t always need to crawl a whole website. The crawler can convert a single local HTML file to Markdown using the very same conversion pipeline as --markdown-export-dir:
# Convert and print Markdown to stdout (pipe-friendly)./siteone-crawler --html-to-markdown=page.html
# Convert and write Markdown to a file./siteone-crawler --html-to-markdown=page.html --html-to-markdown-output=page.md
# Combine with markdown options./siteone-crawler --html-to-markdown=page.html \ --markdown-disable-images \ --markdown-exclude-selector=nav \ --markdown-move-content-before-h1-to-end- No crawling is performed - just one file in, Markdown out.
- By default the converted Markdown is printed to stdout; use
--html-to-markdown-outputto write it to a file instead. - It respects
--markdown-disable-images,--markdown-disable-files,--markdown-move-content-before-h1-to-endand--markdown-exclude-selector. - Because there is no website context, it does not rewrite links - URLs are kept as they appear in the source HTML.
See --html-to-markdown and --html-to-markdown-output.
Browse the Markdown export in your browser
Section titled “Browse the Markdown export in your browser”Markdown is great for AI and Git, but sometimes you just want to read it. The built-in HTTP server renders your Markdown export directory as styled HTML:
./siteone-crawler --serve-markdown=./tmp/mydomain.tld.mdThe viewer renders .md files as nicely styled HTML pages with tables, collapsible accordions, dark/light mode and breadcrumb navigation. No crawling is performed. By default it listens on --serve-port 8321 and binds to --serve-bind-address 127.0.0.1 (localhost only) - use 0.0.0.0 to expose it on your network. See --serve-markdown.
💡Further development ideas
Section titled “💡Further development ideas”If you have ideas how to improve the Website to Markdown Converter, don’t be afraid to send a feature request (to desktop application, or to command-line interface) with a suggestion for improvement. We are happy to consider and implement it if it will benefit more users.