Web scraping is an essential skill for data collection, but getting blocked can be frustrating. In this guide, we'll explore the five most common mistakes that expose your scrapers and learn how to fix them.
Video Tutorial
1. Using Default User Agents
The most obvious red flag is using Python's default user agent. Let's see the difference:
# Bad Practice ❌
import requests
response = requests.get('https://example.com')
# Good Practice ✅
from fake_useragent import UserAgent
ua = UserAgent()
headers = {'User-Agent': ua.random}
response = requests.get('https://example.com', headers=headers)
Default user agents typically look like this:
python-requests/2.28.1
This immediately identifies your scraper as automated. Instead, a proper user agent looks like:
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36
2. Incomplete Headers
Many developers only set the user agent, but real browsers send numerous headers. Here's how to properly set headers:
headers = {
'User-Agent': UserAgent().random,
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.5',
'Accept-Encoding': 'gzip, deflate, br',
'DNT': '1',
'Connection': 'keep-alive',
'Upgrade-Insecure-Requests': '1',
'Sec-Fetch-Dest': 'document',
'Sec-Fetch-Mode': 'navigate',
'Sec-Fetch-Site': 'none',
'Sec-Fetch-User': '?1',
'Cache-Control': 'max-age=0'
}
These headers make your requests appear more legitimate by mimicking real browser behavior.
3. No Rate Limiting
Sending requests too quickly is a common mistake. Implement rate limiting to avoid detection:
import time
import random
def scrape_with_delay(urls):
session = requests.Session()
results = []
for url in urls:
# Random delay between 1-3 seconds
time.sleep(random.uniform(1, 3))
response = session.get(url, headers=headers)
results.append(response.text)
return results
Consider these rate limiting best practices:
- Add random delays between requests
- Implement exponential backoff for errors
- Respect robots.txt guidelines
- Use different delays for different websites
4. Not Using Sessions
Creating new connections for each request is inefficient and suspicious. Use sessions instead:
# Bad Practice ❌
for url in urls:
response = requests.get(url)
# Good Practice ✅
with requests.Session() as session:
for url in urls:
response = session.get(url)
Sessions provide:
- Connection pooling
- Cookie persistence
- Better performance
- More natural browsing patterns
5. Unmasked Automation Tools
When using tools like Playwright or Selenium, proper configuration is crucial. Here's how to mask your automation:
async with async_playwright() as p:
browser = await p.chromium.launch()
context = await browser.new_context(
user_agent='Mozilla/5.0 (Windows NT 10.0; Win64; x64) Chrome/120.0.0.0',
viewport={'width': 1920, 'height': 1080},
device_scale_factor=1,
locale='en-US',
timezone_id='America/New_York',
geolocation={'latitude': 40.730610, 'longitude': -73.935242},
permissions=['geolocation']
)
page = await context.new_page()
# Hide automation indicators
await page.add_init_script("""
Object.defineProperty(navigator, 'webdriver', {
get: () => undefined
});
""")
Additional Tips for Avoiding Blocks
Rotate IP Addresses
- Use proxy services
- Implement IP rotation middleware
- Consider residential proxies for sensitive sites
Handle Cookies Properly
- Store and reuse cookies
- Maintain session state
- Clear cookies periodically
Monitor Response Patterns
- Track response codes
- Implement retry logic
- Watch for CAPTCHAs
Use Error Handling
def resilient_request(url, max_retries=3):
for attempt in range(max_retries):
try:
response = requests.get(url, headers=headers, timeout=10)
response.raise_for_status()
return response
except requests.exceptions.RequestException as e:
if attempt == max_retries - 1:
raise
time.sleep(2 ** attempt) # Exponential backoff
Conclusion
Avoiding blocks is about making your scraper behave more like a human user. Remember:
- Use realistic user agents and headers
- Implement proper rate limiting
- Maintain sessions
- Mask automation indicators
- Handle errors gracefully
By fixing these common mistakes, you'll create more reliable and sustainable web scraping solutions.