Is Web Scraping Legal? A robots.txt and Compliance Guide
Is web scraping legal? A practical, compliance-first guide to robots.txt, Terms of Service, crawl-delay, and public versus private data, so you can crawl responsibly and stay on the right side of the rules.
By the ClawEngine team
June 2026 · 11 min read
Is web scraping legal? It depends on what you crawl and how you behave
Is web scraping legal is one of the most common questions developers ask, and the honest answer is that it depends. Scraping is not inherently illegal, and accessing public information at reasonable rates is widely accepted. But the moment you touch private data, ignore a site's stated wishes, or overload a server, you move into territory that can be both legally and ethically wrong. This guide explains the compliance-first way to crawl: public and permitted data only, robots.txt and Terms of Service respected, crawl-delay honored, and you accountable for what you collect. It is general guidance, not legal advice; consult a lawyer for your specific situation.
Public and permitted data is the line that matters
The single most important distinction in responsible scraping is public versus private. Public data is content a site openly displays to any visitor without a login: published articles, product catalogs, documentation, public listings. Permitted data is content you are explicitly allowed to use, including your own sites and sites whose Terms of Service or an agreement grant you access. Everything else, content behind authentication, paywalls, or access controls, is off limits. ClawEngine is built for public and permitted data only, and that is the right policy for any crawl.
What robots.txt is and why you honor it
robots.txt is a file at the root of a site, for example at example.com/robots.txt, where the owner tells automated agents which paths they may and may not crawl. It is a clear, machine-readable statement of the site's wishes.
# a typical robots.txt
User-agent: *
Disallow: /admin/
Disallow: /checkout/
Crawl-delay: 5
Sitemap: https://example.com/sitemap.xml
The right thing to do is simple: read robots.txt before you crawl, stay out of disallowed paths, and treat the file as binding. Respecting it keeps you a good citizen of the web and reduces the chance of disputes. A compliance-first crawler checks robots.txt automatically and skips anything disallowed.
Honor crawl-delay and crawl gently
The Crawl-delay directive asks you to wait a set number of seconds between requests so you do not overload the server. Even when no delay is specified, crawl politely: limit concurrency, spread requests over time, and back off when a site returns errors or slows down. Hammering a host can degrade service for real users and is exactly the kind of behavior that turns a routine crawl into a problem. Gentle, rate-limited crawling is both kinder and safer.
Read the Terms of Service
A site's Terms of Service may permit, restrict or forbid automated access, and they carry weight. Before crawling at scale, check the ToS and respect what they say. When terms forbid scraping, the responsible choice is to look for an official API, request permission, or find another source. Aligning your crawl with both robots.txt and the ToS is how you stay on the right side of the rules.
Personal and private data deserves extra care
Personal data, anything that identifies an individual, sits under privacy laws such as the GDPR and the CCPA, regardless of whether it appears on a public page. Collecting and processing it carries real obligations. The safe default is to avoid scraping personal data, and to seek proper legal guidance before going anywhere near it. Just because information is visible does not mean it is free to harvest.
Practices that keep you compliant
- Crawl public and permitted data only. If it needs a login or sits behind a paywall, do not crawl it.
- Respect robots.txt and Terms of Service. Treat both as binding statements of the owner's wishes.
- Honor crawl-delay and rate limits. Never overload a server.
- Keep provenance. Record source URLs and dates so you can honor removal requests.
- Avoid personal data. Get legal advice before processing anything that identifies individuals.
- You are responsible for what you crawl. The tool follows the rules; the decision of what to collect is yours.
Compliance-first by design
The way to scrape with confidence is to make compliance the default, not an afterthought. ClawEngine is built for public and permitted data only: it respects robots.txt and Terms of Service, honors crawl-delay, and never aims at authentication, paywalls or private data. The responsibility for what you choose to crawl stays with you, and good defaults make doing the right thing the easy path. Read our compliance policy or see how it works.
See ClawEngine turn pages into clean data
Point ClawEngine at any public or permitted site and get back clean markdown, JSON, or typed structured fields in one call. Crawl at scale, render JavaScript, and feed your RAG pipelines and AI agents, robots.txt and Terms of Service respected.