Skip to main content

Crawlee for Python Tutorial with Examples

· 8 min read
Oleg Kulyk

Crawlee for Python Tutorial with Examples

Web scraping has become an essential tool for data extraction in various industries, from market analysis to academic research. One of the most effective libraries for Python available today is Crawlee, which provides a robust framework for both simple and complex web scraping tasks. Crawlee supports various scraping scenarios, including dealing with static web pages using BeautifulSoup and handling JavaScript-rendered content with Playwright. In this tutorial, we will delve into how to set up and effectively use Crawlee for Python, providing clear examples and best practices to ensure efficient and scalable web scraping operations. This comprehensive guide aims to equip you with the knowledge to build your own web scrapers, whether you are just getting started or looking to implement advanced features. For more detailed documentation, you can visit the Crawlee Documentation and the Crawlee PyPI.

Getting Started with Crawlee for Python

Installation and Setup

To begin using Crawlee for Python, ensure you have Python 3.9 or higher installed on your system. The recommended method for installation is using pip, Python's package installer. Execute the following command to install Crawlee:

pip install crawlee

For users who require additional features, optional extras can be installed:

pip install crawlee[all]

This command installs Crawlee with all available extras, including support for BeautifulSoup and Playwright (Crawlee PyPI).

If you plan to use the PlaywrightCrawler, it's essential to install the Playwright dependencies separately:

playwright install

To verify that Crawlee has been successfully installed, you can run:

python -c "import crawlee; print(crawlee.__version__)"

This command will display the installed version of Crawlee (GitHub - Crawlee Python).

Choosing a Crawler

Crawlee for Python offers two main crawler classes: BeautifulSoupCrawler and PlaywrightCrawler. Both crawlers share the same interface, providing flexibility when switching between them (Crawlee Documentation).

  1. BeautifulSoupCrawler: This is a plain HTTP crawler that parses HTML using the BeautifulSoup library. It's fast and efficient but cannot handle JavaScript rendering.

  2. PlaywrightCrawler: This crawler uses Playwright to control a headless browser, allowing it to handle JavaScript-rendered content and complex web applications.

Creating Your First Web Scraper with Crawlee in Python

Creating your first web scraper with Crawlee is straightforward. Let's walk through the process step by step.

  1. Import Necessary Libraries:
from crawlee import PlaywrightCrawler

# Define the function to handle each page
async def handle_page(page, request):
title = await page.title() # Get the title of the page
content = await page.content() # Get the content of the page
print(f"Title: {title}")
print(f"Content length: {len(content)}")

# Create an instance of PlaywrightCrawler
crawler = PlaywrightCrawler(
handle_page_function=handle_page, # Function to handle each page
max_requests_per_crawler=10 # Limit the number of requests
)

# Run the crawler with the starting URL
await crawler.run(["https://example.com"])
  1. Explanation:
    • Importing Libraries: We import PlaywrightCrawler from the crawlee package.
    • Handle Page Function: The asynchronous function handle_page processes each page, retrieving and printing the title and content.
    • Creating the Crawler: We create an instance of PlaywrightCrawler, specifying the handle_page function and setting a limit on the number of requests.
    • Running the Crawler: Finally, we run the crawler with a starting URL.

Advanced Usage: Crawling Multiple URLs

To expand our crawler to handle multiple starting URLs and implement more advanced features, we can modify our script as follows:

from crawlee import PlaywrightCrawler, Dataset

# Define the function to handle each page
async def handle_page(page, request):
title = await page.title() # Get the title of the page
url = request.url # Get the URL of the page

# Extract all links from the page
links = await page.evaluate('() => Array.from(document.links).map(link => link.href)')

# Save data to dataset
await Dataset.push({
'url': url,
'title': title,
'links_count': len(links)
})

# Enqueue discovered links for crawling
await crawler.add_requests(links)

# Create an instance of PlaywrightCrawler
crawler = PlaywrightCrawler(
handle_page_function=handle_page, # Function to handle each page
max_requests_per_crawler=100, # Limit the number of requests
max_concurrency=5 # Control concurrency
)

# Run the crawler with multiple starting URLs
await crawler.run([
"https://example.com",
"https://another-example.com"
])

This enhanced script showcases:

  1. Handling multiple starting URLs
  2. Extracting and following links from each page
  3. Saving data to a Dataset
  4. Controlling concurrency and request limits

Implementing Custom Logic: Filtering and Processing

To add custom logic for filtering pages and processing data before storing, we can further modify our script:

from crawlee import PlaywrightCrawler, Dataset
import re

# Define the function to handle each page
async def handle_page(page, request):
url = request.url # Get the URL of the page
title = await page.title() # Get the title of the page

# Only process pages with specific patterns
if not re.search(r'(product|category)', url):
print(f"Skipping non-product/category page: {url}")
return

# Extract product information
price = await page.evaluate('() => document.querySelector(".price")?.innerText')
description = await page.evaluate('() => document.querySelector(".description")?.innerText')

# Process and clean data
clean_price = float(price.replace('$', '').strip()) if price else None
clean_description = description.strip() if description else None

# Save processed data
await Dataset.push({
'url': url,
'title': title,
'price': clean_price,
'description': clean_description
})

# Create an instance of PlaywrightCrawler
crawler = PlaywrightCrawler(
handle_page_function=handle_page, # Function to handle each page
max_requests_per_crawler=200, # Limit the number of requests
max_concurrency=10 # Control concurrency
)

# Run the crawler with the starting URL
await crawler.run(["https://example-shop.com"])

This script demonstrates:

  1. URL filtering using regular expressions
  2. Extracting specific elements from the page
  3. Data cleaning and processing
  4. Saving only relevant, processed data to the Dataset

Best Practices and Tips

To ensure efficient and effective crawling with Crawlee for Python, consider the following best practices:

  1. Use Proxy Rotation: For extensive crawling, implement proxy rotation to avoid IP bans. Crawlee supports various proxy configurations.

  2. Implement User-Agents: Rotate user-agents to make your requests appear more natural. This can be achieved by setting the user_agent parameter in the crawler configuration.

  3. Handle Errors Gracefully: Implement try-except blocks in your handle_page function to catch and log errors without stopping the entire crawl.

  4. Respect Robots.txt: Configure your crawler to respect the robots.txt file of the websites you're crawling. This can be done using the respect_robots_txt option in the crawler configuration.

  5. Optimize Storage: Use Crawlee's built-in storage mechanisms efficiently. The Dataset class allows for easy data storage and export in various formats.

  6. Monitor Performance: Keep an eye on your crawler's performance using Crawlee's built-in logging and monitoring features. Adjust concurrency and request limits as needed.

  7. Stay Updated: Regularly update your Crawlee installation to benefit from the latest features and bug fixes.

By following these guidelines and leveraging Crawlee's powerful features, you can build efficient, scalable, and maintainable web scrapers for a wide range of applications, from data analysis to search engine development.

Advanced Usage: Crawling Multiple URLs with Crawlee for Python

Setting Up the Crawler for Multiple URLs

To crawl multiple URLs concurrently with Crawlee, you need to initialize the appropriate crawler class and define a request handler function.

Example: BeautifulSoupCrawler Setup

In this example, we will use BeautifulSoupCrawler to crawl multiple URLs:

from crawlee import BeautifulSoupCrawler

async def main():
crawler = BeautifulSoupCrawler()

async def handle_request(request, page):
# Define your scraping logic here
pass

await crawler.run([handle_request], urls=['https://example1.com', 'https://example2.com'])

This setup allows you to process multiple URLs concurrently, improving the efficiency of your scraping tasks.

Implementing Request Handlers

The request handler defines how each URL should be processed. Here is a more detailed example:

Example: Detailed Request Handler

async def handle_request(request, page):
title = page.select_one('title').text
paragraphs = [p.text for p in page.select('p')]

print(f"URL: {request.url}")
print(f"Title: {title}")
print(f"Number of paragraphs: {len(paragraphs)}")

# Enqueue more links if needed
await crawler.enqueue_links(page, globs=['**/category/*'])

This handler extracts the page title, counts paragraphs, and enqueues additional links for crawling.

Managing the Request Queue

Crawlee uses a RequestQueue to manage URLs. You can add URLs dynamically during the crawling process.

Example: Adding URLs Dynamically

await crawler.add_requests(['https://example3.com', 'https://example4.com'])

This feature allows for dynamic expansion of your crawl scope.

Handling Different Types of Content

Different websites may require different handling approaches. Crawlee provides various crawler classes for these scenarios.

Example: PlaywrightCrawler for JavaScript-rendered Content

from crawlee import PlaywrightCrawler

async def main():
crawler = PlaywrightCrawler()

async def handle_request(request, page):
await page.wait_for_selector('.dynamic-content')
content = await page.inner_text('.dynamic-content')
print(f"Dynamic content: {content}")

await crawler.run([handle_request], urls=['https://js-heavy-site.com'])

This flexibility allows efficient crawling of both static and dynamic websites.

Optimizing Performance and Handling Errors

Optimizing performance and handling errors is crucial when crawling multiple URLs.

Techniques for Optimization:

  1. Automatic retries: Crawlee automatically retries failed requests.
  2. Concurrency control: Adjust the number of concurrent requests.
    crawler = BeautifulSoupCrawler(max_concurrency=5)
  3. Proxy rotation: Use proxy rotation to avoid IP bans.
    crawler = BeautifulSoupCrawler(proxy_configuration={'proxy_urls': ['http://proxy1.com', 'http://proxy2.com']})
  4. Error handling: Implement try-except blocks in your request handler.
    async def handle_request(request, page):
    try:
    # Your scraping logic here
    except Exception as e:
    print(f"Error processing {request.url}: {str(e)}")

These optimizations ensure your crawler remains efficient and resilient.

Conclusion

By leveraging these advanced techniques with Crawlee for Python, you can create powerful web scraping solutions capable of handling complex scenarios and large-scale data extraction tasks. Crawlee's unified interface for HTTP and headless browser crawling, combined with robust error handling and performance optimization features, makes it an excellent choice for building reliable and scalable web crawlers.

For more information, refer to the Crawlee Documentation.

Forget about getting blocked while scraping the Web

Try out ScrapingAnt Web Scraping API with thousands of proxy servers and an entire headless Chrome cluster