Stage 4 Research Report: Web Scraping Approaches vs Industry Best Practice
Document under review: _posts/2021-11-09-re-scrape.md
Date: 2026-02-24
A. Current Scraping Tool Landscape
Tier 1: HTTP-level scrapers (no browser rendering)
- Scrapy (Python): The production workhorse, powering an estimated 34% of production scraping projects. Best for high-volume extraction of static/server-rendered content.
- https://scrapy.org/
- Ruby Mechanize / Nokogiri: Still functional for simple static sites, but Mechanize cannot execute JavaScript and its HTTP patterns are easily fingerprinted by modern bot detection. Considered a legacy choice for well-defended targets.
- https://github.com/sparklemotion/mechanize
- Requests + BeautifulSoup (Python): Same tier — HTTP-only, no JS rendering.
Tier 2: Browser automation (full JS rendering)
- Playwright (Microsoft): Now the consensus recommendation for new projects. Multi-browser support (Chromium, Firefox, WebKit), unified API, auto-waiting. Has overtaken Puppeteer.
- https://playwright.dev/
- Puppeteer (Google): Still relevant for Node.js projects with existing infrastructure, but Chromium-only.
- https://pptr.dev/
- Selenium: Mature but slower and more verbose. Used in legacy codebases.
Tier 3: Managed scraping services and APIs
- Scraping APIs: ScraperAPI, ZenRows, ScrapingBee, Scrapfly, Crawlbase. These handle proxy rotation, CAPTCHA solving, and stealth. You send a URL, they return rendered HTML.
- Data-as-a-Service platforms: Apify (pre-built “Actors”), Zyte, Bright Data. Higher-level — buy datasets or run maintained scraper templates.
- Managed browser platforms: Browserbase, Browserless. Remote headless browsers with stealth and proxy management built in.
Tier 4: Passive/intercepting approaches
- mitmproxy: Man-in-the-middle proxy that intercepts and records HTTP/HTTPS traffic. Does not initiate requests — records traffic from a real browser session, sidestepping bot detection entirely because requests come from a genuine browser with genuine user behaviour.
- https://mitmproxy.org/
Mapping the blog post’s approaches
| Blog post approach | Industry tier | Status in 2025-2026 |
|---|---|---|
| Ruby Mechanize (HTTP-level) | Tier 1 | Dead on arrival against Cloudflare/DataDome. Trivially fingerprinted via TLS. |
| Puppeteer/Apify (headless browser) | Tier 2 | Detectable without stealth hardening. Vanilla Puppeteer blocked by CDP side effects, navigator.webdriver, behavioural analysis. |
| mitmproxy (passive recording) | Tier 4 | Correct — avoids all detection. Tradeoff: no automation, requires human browsing. |
| (Not covered) Managed services | Tier 3 | Where most production scraping of well-defended sites happens. Proxy rotation, CAPTCHA solving, stealth at scale. |
B. Anti-Scraping Defences
Rate limiting and IP reputation
HTTP 429 (Too Many Requests) is the first line of defence. Standard mitigation: exponential backoff with jitter, respecting Retry-After headers, proxy rotation across residential/mobile IP pools. IP reputation databases track datacenter IPs, known proxy ranges, and ASN classifications.
TLS fingerprinting
Servers analyse the TLS Client Hello message (cipher suites, extensions, ordering) to identify the client library. Scrapy, Requests, and Node.js http produce TLS fingerprints that differ from real browsers. Tools like curl-impersonate or tls-client exist to mimic browser TLS fingerprints.
- https://www.browserless.io/blog/tls-fingerprinting-explanation-detection-and-bypassing-it-in-playwright-and-puppeteer
JavaScript challenges and browser fingerprinting
- Cloudflare Turnstile: Cryptographic challenges verifying browser environment integrity.
- https://blog.cloudflare.com/per-customer-bot-defenses/
- DataDome: Deep device fingerprinting — maps declared specs against actual hardware (e.g. a User-Agent claiming Android but reporting 64 CPU cores gets blocked). Tracks mouse movements, scroll patterns, typing cadence.
- Detection vectors:
navigator.webdriverflag,HeadlessChromein User-Agent, missing browser plugins, Canvas/WebGL rendering differences, Chrome DevTools Protocol (CDP) side effects, Playwright-injected globals.
Cloudflare AI Labyrinth (March 2025)
When bot activity is detected, invisible links are injected into pages leading to AI-generated decoy content. Any visitor following these links 4+ levels deep is almost certainly a bot. Available free to all Cloudflare customers.
- https://blog.cloudflare.com/ai-labyrinth/
Headless browser detection — current state
The arms race is tilted toward detection in 2025-2026:
Detectable signals:
navigator.webdriverflag (trivially patchable but its absence is also detectable)- CDP instrumentation traces in the browser’s JavaScript environment
- Playwright-specific globals (
window.__playwright__binding__) - Missing browser plugins/extensions that real users typically have
- Canvas and WebGL rendering differences in headless mode
Evasion tools:
puppeteer-extra-plugin-stealth: Patches common detection vectors. Increasingly insufficient against DataDome and Cloudflare.fingerprint-suite: Generates and injects realistic browser fingerprints.nodriver(Python): Successor toundetected_chromedriver. Avoids CDP-based detection.
Bottom line: vanilla Puppeteer or Playwright against Cloudflare or DataDome will be detected and blocked. The blog post’s experience is completely typical.
- https://blog.castle.io/how-to-detect-headless-chrome-bots-instrumented-with-playwright/
- https://blog.castle.io/from-puppeteer-stealth-to-nodriver-how-anti-detect-frameworks-evolved-to-evade-bot-detection/
C. Legal and Ethical Frameworks
Key court precedents
- hiQ Labs v. LinkedIn (2022): Ninth Circuit ruled that scraping publicly accessible data does not violate the Computer Fraud and Abuse Act (CFAA). Landmark case for scraping legality.
- Meta v. Bright Data (2023-2024): Scraping public profiles is allowed, but scraping content behind login walls or contractual restrictions may constitute breach.
Regulatory frameworks
- GDPR (EU): Scraping personal data without proper legal basis can result in fines up to 20 million euros or 4% of annual revenue. The French CNIL clarified in 2025 that even “public” web pages may contain personal data requiring GDPR safeguards.
- CCPA (California): Similar obligations for personal information of California residents.
- EU AI Act: Scraping for AI training purposes has additional compliance requirements.
robots.txt and Terms of Service
robots.txtis not legally binding, but ignoring it demonstrates bad faith and weakens legal position. Courts have cited robots.txt violations as evidence of intent to circumvent access controls.- ToS violations alone may not create criminal liability, but they can support civil claims (breach of contract, trespass to chattels).
Ethical consensus principles
- Respect
robots.txtand rate limits - Identify your scraper with a real User-Agent string
- Scrape only what you need — minimise server load
- Do not scrape personal data without legal basis
- Do not circumvent authentication or paywalls
- Be transparent about purpose and methods
- Cache aggressively to avoid redundant requests
The mitmproxy approach and ethics
Passive recording of your own browsing session is ethically distinct from automated scraping. Requests originate from a real browser at human speed. The ethical consideration shifts from “am I overwhelming their servers” to “am I using the captured data in ways the site’s ToS permits.”
Sources:
- https://www.grepsr.com/blog/overview-web-scraping-legality/
- https://www.xbyte.io/the-future-of-web-scraping-compliance-navigating-gdpr-ccpa-and-ai-laws-in-2025/
- https://iswebscrapinglegal.com/blog/gdpr-ccpa-web-scraping/
D. Legitimate Alternatives to Scraping
- Official APIs: Always check for an API first — most reliable and legally clean.
- RSS/Atom feeds: Still available on many content sites.
- Data providers: Bright Data, Oxylabs, Zyte sell pre-collected datasets.
- Browser extensions: Instant Data Scraper, Data Miner for small-scale needs.
- Sitemaps:
sitemap.xmlprovides a structured index of all pages. - Direct data licensing: For commercial use at scale, eliminates legal risk entirely.
Summary
The blog post’s escalation path (HTTP client → headless browser → passive proxy) accurately reflects the reality that well-defended sites can defeat both HTTP clients and headless browsers. The post is honest about what doesn’t work and why the passive approach is the only viable option without significant infrastructure investment. The main gap is the missing Tier 3 (managed scraping services), which is where most production scraping of well-defended sites actually happens. The legal and ethical landscape has also evolved significantly since 2021, with key court rulings and regulatory enforcement providing clearer frameworks. The post’s disclaimer section covers the right topics but could be expanded with current references.