Features
Structured data extraction, built into the web scraping API
Define a schema, render the JavaScript, crawl at scale and get clean markdown or typed JSON back from one API. Everything you need to turn public pages into LLM-ready data, without running a scraper fleet.
Markdown & JSON · JavaScript rendered · robots.txt respected
Hit Extract to turn this page into clean, LLM-ready data.
robots.txt respected · public data only
What it does
One API for every scraping job
From a single page to a scheduled crawl of a whole site, these are the building blocks you get with ClawEngine.
Structured data extraction
Define a schema once and get typed records back from any public page. Names, prices, ratings, dates, authors, whatever fields you declare come back as clean, typed JSON, not a brittle pile of selectors you have to maintain.
JavaScript rendering
Every page loads in a managed headless browser before extraction, so single-page apps, dynamic tables and infinite-scroll content come through fully loaded. No empty shells, no missing data from client-rendered sites.
Crawl at scale
Point ClawEngine at a domain and crawl thousands of pages on a schedule. Concurrency, retries, proxy handling and rate limits are managed for you, so you collect clean data without running a fleet.
Markdown and JSON output
Get back clean markdown with the boilerplate stripped, or structured JSON with title, links and metadata. The same shape every time, ready to chunk, embed and feed straight into a model.
Schema extraction
Send a JSON schema with your request and ClawEngine maps the page to it. Typed fields, nested objects and arrays come back validated, so your downstream code can trust the structure.
RAG and agent integrations
LLM-ready output drops into LangChain and LlamaIndex pipelines and into your own agents. Clean chunks go straight to a vector store, so retrieval quality improves without a cleanup step.
SDKs and a simple REST API
Call the REST API with curl, or use the Python and Node SDKs. One endpoint to extract a page, one to crawl a site, and one to poll a crawl, with clear, consistent responses.
Webhooks for crawls
Kick off a large crawl and get the results delivered to your endpoint as they complete. No long-held connections to manage, just a webhook that fires when each batch of pages is ready.
Structured data extraction
Define a schema, get typed records back
The headline feature. Instead of writing CSS selectors that break every time a page changes, you declare the fields you want and ClawEngine returns them as validated, typed JSON from any public page.
- Declare fields once and reuse the schema across thousands of pages
- Typed values: strings, numbers, booleans, dates, nested objects and arrays
- No brittle selectors to maintain when a site changes its markup
- Returned alongside clean markdown, so you keep the prose and the fields
- A compliance line on every result: robots.txt respected, public data only
# request a typed schema
{
"url": "example.com/products/atlas",
"schema": {
"name": "string",
"price": "number",
"rating": "number"
}
}
# response
{ "name": "Atlas Notebook",
"price": 24.00,
"rating": 4.7 }
Everything you need to turn the web into data
Structured extraction, JavaScript rendering, crawl at scale and LLM-ready output, in one API. Public, permitted data only.