Crawlee for Python Tutorial with Examples

Web scraping has become an essential tool for data extraction in various industries, from market analysis to academic research. One of the most effective libraries for Python available today is Crawlee, which provides a robust framework for both simple and complex web scraping tasks. Crawlee supports various scraping scenarios, including dealing with static web pages using BeautifulSoup and handling JavaScript-rendered content with Playwright. In this tutorial, we will delve into how to set up and effectively use Crawlee for Python, providing clear examples and best practices to ensure efficient and scalable web scraping operations. This comprehensive guide aims to equip you with the knowledge to build your own web scrapers, whether you are just getting started or looking to implement advanced features. For more detailed documentation, you can visit the Crawlee Documentation and the Crawlee PyPI.

Getting Started with Crawlee for Python

Installation and Setup

To begin using Crawlee for Python, ensure you have Python 3.9 or higher installed on your system. The recommended method for installation is using pip, Python's package installer. Execute the following command to install Crawlee:

pip install crawlee

For users who require additional features, optional extras can be installed:

pip install crawlee[all]

This command installs Crawlee with all available extras, including support for BeautifulSoup and Playwright (Crawlee PyPI).

If you plan to use the PlaywrightCrawler, it's essential to install the Playwright dependencies separately:

playwright install

To verify that Crawlee has been successfully installed, you can run:

python -c "import crawlee; print(crawlee.__version__)"

This command will display the installed version of Crawlee (GitHub - Crawlee Python).

Choosing a Crawler

Crawlee for Python offers two main crawler classes: BeautifulSoupCrawler and PlaywrightCrawler. Both crawlers share the same interface, providing flexibility when switching between them (Crawlee Documentation).

BeautifulSoupCrawler: This is a plain HTTP crawler that parses HTML using the BeautifulSoup library. It's fast and efficient but cannot handle JavaScript rendering.
PlaywrightCrawler: This crawler uses Playwright to control a headless browser, allowing it to handle JavaScript-rendered content and complex web applications.

Creating Your First Web Scraper with Crawlee in Python

Creating your first web scraper with Crawlee is straightforward. Let's walk through the process step by step.

Import Necessary Libraries:

from crawlee import PlaywrightCrawler

# Define the function to handle each page
async def handle_page(page, request):
    title = await page.title()  # Get the title of the page
    content = await page.content()  # Get the content of the page
    print(f"Title: {title}")
    print(f"Content length: {len(content)}")

# Create an instance of PlaywrightCrawler
crawler = PlaywrightCrawler(
    handle_page_function=handle_page,  # Function to handle each page
    max_requests_per_crawler=10  # Limit the number of requests
)

# Run the crawler with the starting URL
await crawler.run(["https://example.com"])

Explanation:
- Importing Libraries: We import PlaywrightCrawler from the crawlee package.
- Handle Page Function: The asynchronous function handle_page processes each page, retrieving and printing the title and content.
- Creating the Crawler: We create an instance of PlaywrightCrawler, specifying the handle_page function and setting a limit on the number of requests.
- Running the Crawler: Finally, we run the crawler with a starting URL.

Advanced Usage: Crawling Multiple URLs

To expand our crawler to handle multiple starting URLs and implement more advanced features, we can modify our script as follows:

from crawlee import PlaywrightCrawler, Dataset

# Define the function to handle each page
async def handle_page(page, request):
    title = await page.title()  # Get the title of the page
    url = request.url  # Get the URL of the page
    
    # Extract all links from the page
    links = await page.evaluate('() => Array.from(document.links).map(link => link.href)')
    
    # Save data to dataset
    await Dataset.push({
        'url': url,
        'title': title,
        'links_count': len(links)
    })
    
    # Enqueue discovered links for crawling
    await crawler.add_requests(links)

# Create an instance of PlaywrightCrawler
crawler = PlaywrightCrawler(
    handle_page_function=handle_page,  # Function to handle each page
    max_requests_per_crawler=100,  # Limit the number of requests
    max_concurrency=5  # Control concurrency
)

# Run the crawler with multiple starting URLs
await crawler.run([
    "https://example.com",
    "https://another-example.com"
])

This enhanced script showcases:

Handling multiple starting URLs
Extracting and following links from each page
Saving data to a Dataset
Controlling concurrency and request limits

Implementing Custom Logic: Filtering and Processing

To add custom logic for filtering pages and processing data before storing, we can further modify our script:

from crawlee import PlaywrightCrawler, Dataset
import re

# Define the function to handle each page
async def handle_page(page, request):
    url = request.url  # Get the URL of the page
    title = await page.title()  # Get the title of the page
    
    # Only process pages with specific patterns
    if not re.search(r'(product|category)', url):
        print(f"Skipping non-product/category page: {url}")
        return
    
    # Extract product information
    price = await page.evaluate('() => document.querySelector(".price")?.innerText')
    description = await page.evaluate('() => document.querySelector(".description")?.innerText')
    
    # Process and clean data
    clean_price = float(price.replace('$', '').strip()) if price else None
    clean_description = description.strip() if description else None
    
    # Save processed data
    await Dataset.push({
        'url': url,
        'title': title,
        'price': clean_price,
        'description': clean_description
    })

# Create an instance of PlaywrightCrawler
crawler = PlaywrightCrawler(
    handle_page_function=handle_page,  # Function to handle each page
    max_requests_per_crawler=200,  # Limit the number of requests
    max_concurrency=10  # Control concurrency
)

# Run the crawler with the starting URL
await crawler.run(["https://example-shop.com"])

This script demonstrates:

URL filtering using regular expressions
Extracting specific elements from the page
Data cleaning and processing
Saving only relevant, processed data to the Dataset

Best Practices and Tips

To ensure efficient and effective crawling with Crawlee for Python, consider the following best practices:

Use Proxy Rotation: For extensive crawling, implement proxy rotation to avoid IP bans. Crawlee supports various proxy configurations.
Implement User-Agents: Rotate user-agents to make your requests appear more natural. This can be achieved by setting the user_agent parameter in the crawler configuration.
Handle Errors Gracefully: Implement try-except blocks in your handle_page function to catch and log errors without stopping the entire crawl.
Respect Robots.txt: Configure your crawler to respect the robots.txt file of the websites you're crawling. This can be done using the respect_robots_txt option in the crawler configuration.
Optimize Storage: Use Crawlee's built-in storage mechanisms efficiently. The Dataset class allows for easy data storage and export in various formats.
Monitor Performance: Keep an eye on your crawler's performance using Crawlee's built-in logging and monitoring features. Adjust concurrency and request limits as needed.
Stay Updated: Regularly update your Crawlee installation to benefit from the latest features and bug fixes.

By following these guidelines and leveraging Crawlee's powerful features, you can build efficient, scalable, and maintainable web scrapers for a wide range of applications, from data analysis to search engine development.

Advanced Usage: Crawling Multiple URLs with Crawlee for Python

Setting Up the Crawler for Multiple URLs

To crawl multiple URLs concurrently with Crawlee, you need to initialize the appropriate crawler class and define a request handler function.

Example: BeautifulSoupCrawler Setup

In this example, we will use BeautifulSoupCrawler to crawl multiple URLs:

from crawlee import BeautifulSoupCrawler

async def main():
    crawler = BeautifulSoupCrawler()
    
    async def handle_request(request, page):
        # Define your scraping logic here
        pass
    
    await crawler.run([handle_request], urls=['https://example1.com', 'https://example2.com'])

This setup allows you to process multiple URLs concurrently, improving the efficiency of your scraping tasks.

Implementing Request Handlers

The request handler defines how each URL should be processed. Here is a more detailed example:

Example: Detailed Request Handler

async def handle_request(request, page):
    title = page.select_one('title').text
    paragraphs = [p.text for p in page.select('p')]
    
    print(f"URL: {request.url}")
    print(f"Title: {title}")
    print(f"Number of paragraphs: {len(paragraphs)}")
    
    # Enqueue more links if needed
    await crawler.enqueue_links(page, globs=['**/category/*'])

This handler extracts the page title, counts paragraphs, and enqueues additional links for crawling.

Managing the Request Queue

Crawlee uses a RequestQueue to manage URLs. You can add URLs dynamically during the crawling process.

Example: Adding URLs Dynamically

await crawler.add_requests(['https://example3.com', 'https://example4.com'])

This feature allows for dynamic expansion of your crawl scope.

Handling Different Types of Content

Different websites may require different handling approaches. Crawlee provides various crawler classes for these scenarios.

Example: PlaywrightCrawler for JavaScript-rendered Content

from crawlee import PlaywrightCrawler

async def main():
    crawler = PlaywrightCrawler()
    
    async def handle_request(request, page):
        await page.wait_for_selector('.dynamic-content')
        content = await page.inner_text('.dynamic-content')
        print(f"Dynamic content: {content}")
    
    await crawler.run([handle_request], urls=['https://js-heavy-site.com'])

This flexibility allows efficient crawling of both static and dynamic websites.

Optimizing Performance and Handling Errors

Optimizing performance and handling errors is crucial when crawling multiple URLs.

Techniques for Optimization:

Automatic retries: Crawlee automatically retries failed requests.
Concurrency control: Adjust the number of concurrent requests.
```
crawler = BeautifulSoupCrawler(max_concurrency=5)
```

Proxy rotation: Use proxy rotation to avoid IP bans.

crawler = BeautifulSoupCrawler(proxy_configuration={'proxy_urls': ['http://proxy1.com', 'http://proxy2.com']})

Error handling: Implement try-except blocks in your request handler.

async def handle_request(request, page):
    try:
        # Your scraping logic here
    except Exception as e:
        print(f"Error processing {request.url}: {str(e)}")

These optimizations ensure your crawler remains efficient and resilient.

Conclusion

By leveraging these advanced techniques with Crawlee for Python, you can create powerful web scraping solutions capable of handling complex scenarios and large-scale data extraction tasks. Crawlee's unified interface for HTTP and headless browser crawling, combined with robust error handling and performance optimization features, makes it an excellent choice for building reliable and scalable web crawlers.

For more information, refer to the Crawlee Documentation.

Crawlee for Python Tutorial with Examples

Getting Started with Crawlee for Python

Installation and Setup

Choosing a Crawler

Creating Your First Web Scraper with Crawlee in Python

Advanced Usage: Crawling Multiple URLs

Implementing Custom Logic: Filtering and Processing

Best Practices and Tips

Advanced Usage: Crawling Multiple URLs with Crawlee for Python

Setting Up the Crawler for Multiple URLs

Example: BeautifulSoupCrawler Setup

Implementing Request Handlers

Example: Detailed Request Handler

Managing the Request Queue

Example: Adding URLs Dynamically

Handling Different Types of Content

Example: PlaywrightCrawler for JavaScript-rendered Content

Optimizing Performance and Handling Errors

Techniques for Optimization:

Conclusion

Forget about getting blocked while scraping the Web

LLM-ready data extraction

Getting Started with Crawlee for Python​

Installation and Setup​

Choosing a Crawler​

Creating Your First Web Scraper with Crawlee in Python​

Advanced Usage: Crawling Multiple URLs​

Implementing Custom Logic: Filtering and Processing​

Best Practices and Tips​

Advanced Usage: Crawling Multiple URLs with Crawlee for Python

Setting Up the Crawler for Multiple URLs​

Example: BeautifulSoupCrawler Setup​

Implementing Request Handlers​

Example: Detailed Request Handler​

Managing the Request Queue​

Example: Adding URLs Dynamically​

Handling Different Types of Content​

Example: PlaywrightCrawler for JavaScript-rendered Content​

Optimizing Performance and Handling Errors​

Techniques for Optimization:​

Conclusion​

Forget about getting blocked while scraping the Web

LLM-ready data extraction

Getting Started with Crawlee for Python

Installation and Setup

Choosing a Crawler

Creating Your First Web Scraper with Crawlee in Python

Advanced Usage: Crawling Multiple URLs

Implementing Custom Logic: Filtering and Processing

Best Practices and Tips

Setting Up the Crawler for Multiple URLs

Example: BeautifulSoupCrawler Setup

Implementing Request Handlers

Example: Detailed Request Handler

Managing the Request Queue

Example: Adding URLs Dynamically

Handling Different Types of Content

Example: PlaywrightCrawler for JavaScript-rendered Content

Optimizing Performance and Handling Errors

Techniques for Optimization:

Conclusion