Skip to main content

Stop Getting Blocked! Fix These 5 Python Web Scraping Mistakes

· 4 min read
Oleg Kulyk

Stop Getting Blocked! Fix These 5 Python Web Scraping Mistakes

Web scraping is an essential skill for data collection, but getting blocked can be frustrating. In this guide, we'll explore the five most common mistakes that expose your scrapers and learn how to fix them.

Video Tutorial

1. Using Default User Agents

The most obvious red flag is using Python's default user agent. Let's see the difference:

# Bad Practice ❌
import requests
response = requests.get('https://example.com')

# Good Practice ✅
from fake_useragent import UserAgent
ua = UserAgent()
headers = {'User-Agent': ua.random}
response = requests.get('https://example.com', headers=headers)

Default user agents typically look like this:

python-requests/2.28.1

This immediately identifies your scraper as automated. Instead, a proper user agent looks like:

Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36

2. Incomplete Headers

Many developers only set the user agent, but real browsers send numerous headers. Here's how to properly set headers:

headers = {
'User-Agent': UserAgent().random,
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.5',
'Accept-Encoding': 'gzip, deflate, br',
'DNT': '1',
'Connection': 'keep-alive',
'Upgrade-Insecure-Requests': '1',
'Sec-Fetch-Dest': 'document',
'Sec-Fetch-Mode': 'navigate',
'Sec-Fetch-Site': 'none',
'Sec-Fetch-User': '?1',
'Cache-Control': 'max-age=0'
}

These headers make your requests appear more legitimate by mimicking real browser behavior.

3. No Rate Limiting

Sending requests too quickly is a common mistake. Implement rate limiting to avoid detection:

import time
import random

def scrape_with_delay(urls):
session = requests.Session()
results = []

for url in urls:
# Random delay between 1-3 seconds
time.sleep(random.uniform(1, 3))
response = session.get(url, headers=headers)
results.append(response.text)

return results

Consider these rate limiting best practices:

  • Add random delays between requests
  • Implement exponential backoff for errors
  • Respect robots.txt guidelines
  • Use different delays for different websites

4. Not Using Sessions

Creating new connections for each request is inefficient and suspicious. Use sessions instead:

# Bad Practice ❌
for url in urls:
response = requests.get(url)

# Good Practice ✅
with requests.Session() as session:
for url in urls:
response = session.get(url)

Sessions provide:

  • Connection pooling
  • Cookie persistence
  • Better performance
  • More natural browsing patterns

5. Unmasked Automation Tools

When using tools like Playwright or Selenium, proper configuration is crucial. Here's how to mask your automation:

async with async_playwright() as p:
browser = await p.chromium.launch()
context = await browser.new_context(
user_agent='Mozilla/5.0 (Windows NT 10.0; Win64; x64) Chrome/120.0.0.0',
viewport={'width': 1920, 'height': 1080},
device_scale_factor=1,
locale='en-US',
timezone_id='America/New_York',
geolocation={'latitude': 40.730610, 'longitude': -73.935242},
permissions=['geolocation']
)

page = await context.new_page()

# Hide automation indicators
await page.add_init_script("""
Object.defineProperty(navigator, 'webdriver', {
get: () => undefined
});
""")

Additional Tips for Avoiding Blocks

  • Rotate IP Addresses

    • Use proxy services
    • Implement IP rotation middleware
    • Consider residential proxies for sensitive sites
  • Handle Cookies Properly

    • Store and reuse cookies
    • Maintain session state
    • Clear cookies periodically
  • Monitor Response Patterns

    • Track response codes
    • Implement retry logic
    • Watch for CAPTCHAs
  • Use Error Handling

def resilient_request(url, max_retries=3):
for attempt in range(max_retries):
try:
response = requests.get(url, headers=headers, timeout=10)
response.raise_for_status()
return response
except requests.exceptions.RequestException as e:
if attempt == max_retries - 1:
raise
time.sleep(2 ** attempt) # Exponential backoff

Conclusion

Avoiding blocks is about making your scraper behave more like a human user. Remember:

  • Use realistic user agents and headers
  • Implement proper rate limiting
  • Maintain sessions
  • Mask automation indicators
  • Handle errors gracefully

By fixing these common mistakes, you'll create more reliable and sustainable web scraping solutions.

Resources

Forget about getting blocked while scraping the Web

Try out ScrapingAnt Web Scraping API with thousands of proxy servers and an entire headless Chrome cluster