Stop Getting Blocked! Fix These 5 Python Web Scraping Mistakes

Web scraping is an essential skill for data collection, but getting blocked can be frustrating. In this guide, we'll explore the five most common mistakes that expose your scrapers and learn how to fix them.

Video Tutorial

1. Using Default User Agents

The most obvious red flag is using Python's default user agent. Let's see the difference:

# Bad Practice ❌
import requests
response = requests.get('https://example.com')

# Good Practice ✅
from fake_useragent import UserAgent
ua = UserAgent()
headers = {'User-Agent': ua.random}
response = requests.get('https://example.com', headers=headers)

Default user agents typically look like this:

python-requests/2.28.1

This immediately identifies your scraper as automated. Instead, a proper user agent looks like:

Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36

2. Incomplete Headers

Many developers only set the user agent, but real browsers send numerous headers. Here's how to properly set headers:

headers = {
    'User-Agent': UserAgent().random,
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
    'Accept-Language': 'en-US,en;q=0.5',
    'Accept-Encoding': 'gzip, deflate, br',
    'DNT': '1',
    'Connection': 'keep-alive',
    'Upgrade-Insecure-Requests': '1',
    'Sec-Fetch-Dest': 'document',
    'Sec-Fetch-Mode': 'navigate',
    'Sec-Fetch-Site': 'none',
    'Sec-Fetch-User': '?1',
    'Cache-Control': 'max-age=0'
}

These headers make your requests appear more legitimate by mimicking real browser behavior.

3. No Rate Limiting

Sending requests too quickly is a common mistake. Implement rate limiting to avoid detection:

import time
import random

def scrape_with_delay(urls):
    session = requests.Session()
    results = []

    for url in urls:
        # Random delay between 1-3 seconds
        time.sleep(random.uniform(1, 3))
        response = session.get(url, headers=headers)
        results.append(response.text)

    return results

Consider these rate limiting best practices:

Add random delays between requests
Implement exponential backoff for errors
Respect robots.txt guidelines
Use different delays for different websites

4. Not Using Sessions

Creating new connections for each request is inefficient and suspicious. Use sessions instead:

# Bad Practice ❌
for url in urls:
    response = requests.get(url)

# Good Practice ✅
with requests.Session() as session:
    for url in urls:
        response = session.get(url)

Sessions provide:

Connection pooling
Cookie persistence
Better performance
More natural browsing patterns

5. Unmasked Automation Tools

When using tools like Playwright or Selenium, proper configuration is crucial. Here's how to mask your automation:

async with async_playwright() as p:
    browser = await p.chromium.launch()
    context = await browser.new_context(
        user_agent='Mozilla/5.0 (Windows NT 10.0; Win64; x64) Chrome/120.0.0.0',
        viewport={'width': 1920, 'height': 1080},
        device_scale_factor=1,
        locale='en-US',
        timezone_id='America/New_York',
        geolocation={'latitude': 40.730610, 'longitude': -73.935242},
        permissions=['geolocation']
    )

    page = await context.new_page()

    # Hide automation indicators
    await page.add_init_script("""
        Object.defineProperty(navigator, 'webdriver', {
            get: () => undefined
        });
    """)

Additional Tips for Avoiding Blocks

Rotate IP Addresses
- Use proxy services
- Implement IP rotation middleware
- Consider residential proxies for sensitive sites
Handle Cookies Properly
- Store and reuse cookies
- Maintain session state
- Clear cookies periodically
Monitor Response Patterns
- Track response codes
- Implement retry logic
- Watch for CAPTCHAs
Use Error Handling

def resilient_request(url, max_retries=3):
    for attempt in range(max_retries):
        try:
            response = requests.get(url, headers=headers, timeout=10)
            response.raise_for_status()
            return response
        except requests.exceptions.RequestException as e:
            if attempt == max_retries - 1:
                raise
            time.sleep(2 ** attempt)  # Exponential backoff

Conclusion

Avoiding blocks is about making your scraper behave more like a human user. Remember:

Use realistic user agents and headers
Implement proper rate limiting
Maintain sessions
Mask automation indicators
Handle errors gracefully

By fixing these common mistakes, you'll create more reliable and sustainable web scraping solutions.

Stop Getting Blocked! Fix These 5 Python Web Scraping Mistakes

Video Tutorial

1. Using Default User Agents

2. Incomplete Headers

3. No Rate Limiting

4. Not Using Sessions

5. Unmasked Automation Tools

Additional Tips for Avoiding Blocks

Conclusion

Resources

Forget about getting blocked while scraping the Web

LLM-ready data extraction

Video Tutorial​

1. Using Default User Agents​

2. Incomplete Headers​

3. No Rate Limiting​

4. Not Using Sessions​

5. Unmasked Automation Tools​

Additional Tips for Avoiding Blocks​

Conclusion​

Resources​

Forget about getting blocked while scraping the Web

LLM-ready data extraction

Video Tutorial

1. Using Default User Agents

2. Incomplete Headers

3. No Rate Limiting

4. Not Using Sessions

5. Unmasked Automation Tools

Additional Tips for Avoiding Blocks

Conclusion

Resources