Compliance

Web scraping compliance, built in by default

ClawEngine is for public and permitted data only. It respects robots.txt and site Terms of Service, honors crawl-delay, and is designed for compliance-friendly use. Compliance is a feature here, not fine print, so your data pipeline stays defensible.

Read the FAQ

Public and permitted data only

ClawEngine crawls public web pages and data you are permitted to access. It is not built to reach anything behind a login, a paywall or an access control.

robots.txt and ToS respected

Every request reads robots.txt before crawling and respects what a site allows. We honor site Terms of Service as part of responsible, defensible crawling.

Crawl-delay honored

When a site asks crawlers to slow down, ClawEngine listens. We honor crawl-delay and rate-limit politely so we do not overload the sites we visit.

What ClawEngine is built for

Public documentation and knowledge bases
Public product catalogs and listings
Public articles, news and reference pages
Your own websites and content
Sites you have explicit permission to crawl

What it is not for

Reaching content behind authentication or logins
Getting past paywalls or subscription walls
Evading bot detection or access controls
Collecting private or personal data
Anything a site's robots.txt or Terms of Service disallow

You are responsible for what you crawl

ClawEngine gives you the tools to crawl and extract public data responsibly, but you decide which URLs to point it at. You are responsible for ensuring that you have the right to crawl those pages and use the data, and that doing so is lawful where you operate. That includes following each site's Terms of Service, respecting intellectual property, and complying with data protection and privacy laws that apply to you and to anyone whose data might appear on a public page.

Before you start a crawl, make sure the data is public and permitted, that you are not collecting personal data you have no basis to process, and that your use of the output is consistent with the source site's terms. When in doubt, crawl your own sites or sites that have given you permission.

How ClawEngine enforces this

Compliance is wired into the engine, not left to good intentions. Every request reads robots.txt first and respects what a site allows for crawlers. We honor crawl-delay and rate-limit politely so we do not overload a site. We do not provide features designed to bypass authentication, defeat paywalls or evade bot detection, and our acceptable use policy prohibits using ClawEngine for any of those purposes. Each result carries a clear signal that robots.txt was respected and the data was public.

Data handling

The data we crawl on your instructions is processed to produce your result and is handled in line with our privacy policy. We do not sell it, and we do not use the public data we crawl for you to train public or third-party models. The output is yours. You can ask us to delete the requests and data you have submitted at any time, and you control what happens to the results once they reach your systems.

If something looks wrong

If you operate a site and believe ClawEngine has crawled it in a way that does not respect your robots.txt or Terms of Service, contact us and we will look into it promptly. We want responsible crawling to be the default for everyone, the sites we visit included.

In short. Public, permitted data only. robots.txt and Terms of Service respected. Crawl-delay honored. You are responsible for what you crawl. Questions? Email team@clawengine.ai.

Compliant web scraping, by default

Turn public pages into clean, LLM-ready data on an engine that respects robots.txt and Terms of Service out of the box. Public, permitted data only.

See how it works