Stage 4 Research Report: Web Scraping Approaches vs Industry Best Practice

Document under review: _posts/2021-11-09-re-scrape.md Date: 2026-02-24

A. Current Scraping Tool Landscape

Tier 1: HTTP-level scrapers (no browser rendering)

Scrapy (Python): The production workhorse, powering an estimated 34% of production scraping projects. Best for high-volume extraction of static/server-rendered content.
- https://scrapy.org/
Ruby Mechanize / Nokogiri: Still functional for simple static sites, but Mechanize cannot execute JavaScript and its HTTP patterns are easily fingerprinted by modern bot detection. Considered a legacy choice for well-defended targets.
- https://github.com/sparklemotion/mechanize
Requests + BeautifulSoup (Python): Same tier — HTTP-only, no JS rendering.

Tier 2: Browser automation (full JS rendering)

Playwright (Microsoft): Now the consensus recommendation for new projects. Multi-browser support (Chromium, Firefox, WebKit), unified API, auto-waiting. Has overtaken Puppeteer.
- https://playwright.dev/
Puppeteer (Google): Still relevant for Node.js projects with existing infrastructure, but Chromium-only.
- https://pptr.dev/
Selenium: Mature but slower and more verbose. Used in legacy codebases.

Tier 3: Managed scraping services and APIs

Scraping APIs: ScraperAPI, ZenRows, ScrapingBee, Scrapfly, Crawlbase. These handle proxy rotation, CAPTCHA solving, and stealth. You send a URL, they return rendered HTML.
Data-as-a-Service platforms: Apify (pre-built “Actors”), Zyte, Bright Data. Higher-level — buy datasets or run maintained scraper templates.
Managed browser platforms: Browserbase, Browserless. Remote headless browsers with stealth and proxy management built in.

Tier 4: Passive/intercepting approaches

mitmproxy: Man-in-the-middle proxy that intercepts and records HTTP/HTTPS traffic. Does not initiate requests — records traffic from a real browser session, sidestepping bot detection entirely because requests come from a genuine browser with genuine user behaviour.
- https://mitmproxy.org/

Mapping the blog post’s approaches

Blog post approach	Industry tier	Status in 2025-2026
Ruby Mechanize (HTTP-level)	Tier 1	Dead on arrival against Cloudflare/DataDome. Trivially fingerprinted via TLS.
Puppeteer/Apify (headless browser)	Tier 2	Detectable without stealth hardening. Vanilla Puppeteer blocked by CDP side effects, `navigator.webdriver`, behavioural analysis.
mitmproxy (passive recording)	Tier 4	Correct — avoids all detection. Tradeoff: no automation, requires human browsing.
(Not covered) Managed services	Tier 3	Where most production scraping of well-defended sites happens. Proxy rotation, CAPTCHA solving, stealth at scale.

B. Anti-Scraping Defences

Rate limiting and IP reputation

HTTP 429 (Too Many Requests) is the first line of defence. Standard mitigation: exponential backoff with jitter, respecting Retry-After headers, proxy rotation across residential/mobile IP pools. IP reputation databases track datacenter IPs, known proxy ranges, and ASN classifications.

TLS fingerprinting

Servers analyse the TLS Client Hello message (cipher suites, extensions, ordering) to identify the client library. Scrapy, Requests, and Node.js http produce TLS fingerprints that differ from real browsers. Tools like curl-impersonate or tls-client exist to mimic browser TLS fingerprints.

https://www.browserless.io/blog/tls-fingerprinting-explanation-detection-and-bypassing-it-in-playwright-and-puppeteer

JavaScript challenges and browser fingerprinting

Cloudflare Turnstile: Cryptographic challenges verifying browser environment integrity.
- https://blog.cloudflare.com/per-customer-bot-defenses/
DataDome: Deep device fingerprinting — maps declared specs against actual hardware (e.g. a User-Agent claiming Android but reporting 64 CPU cores gets blocked). Tracks mouse movements, scroll patterns, typing cadence.
Detection vectors: navigator.webdriver flag, HeadlessChrome in User-Agent, missing browser plugins, Canvas/WebGL rendering differences, Chrome DevTools Protocol (CDP) side effects, Playwright-injected globals.

Cloudflare AI Labyrinth (March 2025)

When bot activity is detected, invisible links are injected into pages leading to AI-generated decoy content. Any visitor following these links 4+ levels deep is almost certainly a bot. Available free to all Cloudflare customers.

https://blog.cloudflare.com/ai-labyrinth/

Headless browser detection — current state

The arms race is tilted toward detection in 2025-2026:

Detectable signals:

navigator.webdriver flag (trivially patchable but its absence is also detectable)
CDP instrumentation traces in the browser’s JavaScript environment
Playwright-specific globals (window.__playwright__binding__)
Missing browser plugins/extensions that real users typically have
Canvas and WebGL rendering differences in headless mode

Evasion tools:

puppeteer-extra-plugin-stealth: Patches common detection vectors. Increasingly insufficient against DataDome and Cloudflare.
fingerprint-suite: Generates and injects realistic browser fingerprints.
nodriver (Python): Successor to undetected_chromedriver. Avoids CDP-based detection.

Bottom line: vanilla Puppeteer or Playwright against Cloudflare or DataDome will be detected and blocked. The blog post’s experience is completely typical.

https://blog.castle.io/how-to-detect-headless-chrome-bots-instrumented-with-playwright/
https://blog.castle.io/from-puppeteer-stealth-to-nodriver-how-anti-detect-frameworks-evolved-to-evade-bot-detection/

C. Legal and Ethical Frameworks

Key court precedents

hiQ Labs v. LinkedIn (2022): Ninth Circuit ruled that scraping publicly accessible data does not violate the Computer Fraud and Abuse Act (CFAA). Landmark case for scraping legality.
Meta v. Bright Data (2023-2024): Scraping public profiles is allowed, but scraping content behind login walls or contractual restrictions may constitute breach.

Regulatory frameworks

GDPR (EU): Scraping personal data without proper legal basis can result in fines up to 20 million euros or 4% of annual revenue. The French CNIL clarified in 2025 that even “public” web pages may contain personal data requiring GDPR safeguards.
CCPA (California): Similar obligations for personal information of California residents.
EU AI Act: Scraping for AI training purposes has additional compliance requirements.

robots.txt and Terms of Service

robots.txt is not legally binding, but ignoring it demonstrates bad faith and weakens legal position. Courts have cited robots.txt violations as evidence of intent to circumvent access controls.
ToS violations alone may not create criminal liability, but they can support civil claims (breach of contract, trespass to chattels).

Ethical consensus principles

Respect robots.txt and rate limits
Identify your scraper with a real User-Agent string
Scrape only what you need — minimise server load
Do not scrape personal data without legal basis
Do not circumvent authentication or paywalls
Be transparent about purpose and methods
Cache aggressively to avoid redundant requests

The mitmproxy approach and ethics

Passive recording of your own browsing session is ethically distinct from automated scraping. Requests originate from a real browser at human speed. The ethical consideration shifts from “am I overwhelming their servers” to “am I using the captured data in ways the site’s ToS permits.”

Sources:

https://www.grepsr.com/blog/overview-web-scraping-legality/
https://www.xbyte.io/the-future-of-web-scraping-compliance-navigating-gdpr-ccpa-and-ai-laws-in-2025/
https://iswebscrapinglegal.com/blog/gdpr-ccpa-web-scraping/

D. Legitimate Alternatives to Scraping

Official APIs: Always check for an API first — most reliable and legally clean.
RSS/Atom feeds: Still available on many content sites.
Data providers: Bright Data, Oxylabs, Zyte sell pre-collected datasets.
Browser extensions: Instant Data Scraper, Data Miner for small-scale needs.
Sitemaps: sitemap.xml provides a structured index of all pages.
Direct data licensing: For commercial use at scale, eliminates legal risk entirely.

Summary

The blog post’s escalation path (HTTP client → headless browser → passive proxy) accurately reflects the reality that well-defended sites can defeat both HTTP clients and headless browsers. The post is honest about what doesn’t work and why the passive approach is the only viable option without significant infrastructure investment. The main gap is the missing Tier 3 (managed scraping services), which is where most production scraping of well-defended sites actually happens. The legal and ethical landscape has also evolved significantly since 2021, with key court rulings and regulatory enforcement providing clearer frameworks. The post’s disclaimer section covers the right topics but could be expanded with current references.