How to Crawl a Website for LLM Training Data
How to crawl a website for LLM training data the clean way: discover URLs, render pages, strip boilerplate, and export tidy markdown that is ready to chunk, embed and train on. Public and permitted data only.
By the ClawEngine team
June 2026 · 10 min read
How to crawl a website for LLM training data without drowning in messy HTML
Crawling a website for LLM training data sounds simple until you do it. You fetch a page, get back a wall of navigation menus, cookie banners, footers, ad slots and inline scripts, and somewhere inside that noise is the actual content you wanted. Multiply that by ten thousand pages and the cleanup becomes the whole project. The goal of a good crawl is not just to download HTML, it is to produce clean, consistent, deduplicated text that is ready to chunk, embed and train on.
This guide walks through the full pipeline: discovering the right URLs, rendering each page, stripping boilerplate down to the real content, and exporting tidy markdown. It assumes you are working with public and permitted data only, which is the only kind of data you should be training on.
Start with permission, not with code
Before you crawl a single page, decide whether you are allowed to. Stick to content that is public and that you have permission to use: your own sites, sites that explicitly allow crawling, public documentation, open datasets, and sources whose Terms of Service permit your use. Read the site's robots.txt, respect any disallowed paths, and honor crawl-delay so you are not hammering a server. If a page sits behind a login, a paywall or any access control, it is off limits. Compliance is not a footnote here, it is step one, and it protects both the sites you crawl and the model you are building.
Step 1: Discover the URLs worth crawling
A crawl is only as good as its URL frontier. Begin with the site's sitemap.xml, which usually lists every canonical page the owner wants indexed. From there you can expand by following internal links, but set sensible boundaries: stay on-domain, cap crawl depth, skip query-string duplicates, and exclude paths you do not want (search results, tag pages, infinite calendars). Maintain a visited set so you never fetch the same URL twice.
# start a managed crawl from a sitemap, markdown output
curl https://api.clawengine.ai/v1/crawl \
-H "Authorization: Bearer $KEY" \
-d '{"url":"https://docs.example.com","limit":5000,"format":"markdown"}'
Step 2: Render the page so the content actually exists
Many modern sites ship a near-empty HTML shell and build the page in the browser with JavaScript. If you only fetch the raw response, you get an empty div and nothing to train on. To capture the real content you need to render the page in a headless browser, wait for the network to settle, and then read the resulting DOM. Doing this yourself means running and scaling a browser fleet. A web scraping API handles rendering for you so each page comes back fully populated.
Step 3: Strip boilerplate down to real content
This is where most training corpora are won or lost. Navigation, headers, footers, share buttons, related-posts widgets and cookie notices repeat on every page and add nothing but noise and duplication. Extract the main content region, drop the chrome, and keep the meaningful structure: headings, paragraphs, lists, tables and code blocks. Clean markdown is an excellent target format because it preserves that structure in a compact, token-efficient way that chunkers and embedders handle well.
# python: collect rendered markdown for each crawled page
import requests
resp = requests.post(
"https://api.clawengine.ai/v1/crawl",
headers={"Authorization": f"Bearer {KEY}"},
json={"url": "https://docs.example.com", "limit": 5000, "format": "markdown"},
)
for page in resp.json()["pages"]:
save(page["url"], page["markdown"]) # clean, boilerplate-stripped
Step 4: Deduplicate, normalize and chunk
Even a clean crawl produces near-duplicates: paginated lists, printer-friendly variants, the same article under two URLs. Hash the cleaned text and drop repeats so your model does not over-weight content that simply appears more often. Normalize whitespace and encoding, keep a record of each document's source URL and crawl date for provenance, then chunk by heading or by a fixed token window with a little overlap. Consistent chunks make for consistent embeddings.
Step 5: Keep provenance and respect removal requests
Store the source URL, the fetch timestamp and the robots status alongside every document. Provenance lets you honor a later request to remove a source, audit where a model learned something, and re-crawl only what changed. Treat your corpus as a living thing that respects the wishes of the sites it came from.
Let the crawl be the easy part
The hard parts of building LLM training data are rendering, cleanup, deduplication and staying compliant, not the act of downloading a page. ClawEngine crawls public and permitted sites, renders JavaScript, strips boilerplate, and hands back clean markdown or JSON ready to chunk and embed, while respecting robots.txt and Terms of Service the whole way. See how it works or read our robots.txt and compliance guide before you start.
See ClawEngine turn pages into clean data
Point ClawEngine at any public or permitted site and get back clean markdown, JSON, or typed structured fields in one call. Crawl at scale, render JavaScript, and feed your RAG pipelines and AI agents, robots.txt and Terms of Service respected.