Beginner: I'm new to scraping and being blocked

Jump to bottom Edit New page

berstend̡̲̫̹̠̖͚͓̔̄̓̐̄͛̀͘ edited this page Apr 26, 2021 · 7 revisions

Problem: Your scraper is being blocked

This wiki aims to be a beginner friendly entry point in understanding why this could happen and how to mitigate it.

Note: This document is only relevant if there are issues, if your custom shell script loop using curl runs fine that's great.

Most common issues

You're using a non-browser based scraper (curl, requests, scrapy, etc)

The days where this was sufficient are long gone now 😄
It's easy for a site to use JS to gather or calculate some data and require that in their backend (sent in the form of cookies/headers/post data)
In addition most sites are built with dynamic JS nowadays, so static html scraping won't get you far
Solution: Switch to a scraping framework which uses a real browser (puppeteer, playwright)

You're using Selenium

Selenium is the grandfather of browser based scraping frameworks and leaks it's presence in too many ways
This applies to anything that is not a real browser as well: Scrapy's Splash, PhantomJS, Electron, CasperJS, etc
Solution: Don't use Selenium, use puppeteer or playwright

You're using puppeteer without stealth

By default the usage of puppeteer (in both headless and headful mode) can be detected by a site
Solution: Use puppeteer-extra-plugin-stealth

You're using non-sensical data

Don't try to emulate another browser engine or device type (e.g. mobile) when using a desktop browser
Don't use data that doesn't make sense (e.g. macOS platform with a Nvidia RTX 3080 GPU)

Your IP address is bad

Don't use free proxies from the internet, they are being detected as such easily
Don't use Tor, all exit nodes are public and the network is meant for people in need
Don't use your home internet too often or you might experience rate-limiting or bans
Don't use datacenter IPs or proxies, they can be detected as not being "residential"

How bot detection works

(TODO: Add more content here)