In this article, we will learn how to create a simple e-commerce search API with multiple platform support: eBay and Amazon. AutoScraper and FastAPi provide the ability to create a powerful JSON API for the date. With Playwright's help, we'll extend our scraper and avoid blocking by using ScrapingAnt's web scraping API.
To complete the whole tutorial we'll need the following libraries:
- FastAPI - to implement the API
- Uvicorn - to serve the API
- AutoScraper - to parse data into structured format
- Playwright or ScrapingAnt API client - for blocking avoidance and web scraping at scale
To install all the requirements via
pip just run the following command (Python 3.7+ required):
Not all the described libraries are required for a complete project, but I'd suggest installing them to try different ways of data extraction.
Let's start from the most interesting part - creating a smart scraper to fetch data from Amazon's search result page. For a simple case, we'll use the
product link as a required data for extraction. To generate a product link for Amazon's product we'll need to get a unique product id like
Using AutoScraper, it would be easily done by just providing some sample data:
The code above tells us how to apply several samples when the structure of items on search result pages is different, and we want the scraper to learn them all, which is redundant for the current case and added to show this cool feature.
If you want to copy and run this code, you may need to update the
wanted_dict with your values.
From the output, we’ll know which rule corresponds to which data:
After analyzing the output, let’s keep only desired rules, remove the rest, and save our model:
Let's do the same trick for eBay:
... and the execution hangs.
What is the reason? Why it not worked?
AutoScraper is a good tool for automatic intelligent parsing. Still, it does not have capabilities for proper data extraction from the page, so the URL's direct data extraction can not work for some sites.
Let's try the workaround for the learning phase only: save the required page as HTML via the browser and then open via Python to pass the data into AutoScraper:
How to extract data and avoid anti-scraping techniques? Let's try out two different approaches:
Playwright is a high-level API over headless Chrome, Firefox and Webkit browser. It means that you can automate the action that we made above (open the page's HTML), and this action will not trigger the website's anti-scraping mechanism. Let's create a function that does content extraction:
This function allows us to get the page's HTML content without being blocked and detected as a bot.
Also, we'll need to apply rotating proxies to avoid rate-limiting for the web scraping at scale. Check out how to use a proxy in Playwright.
Let's check out the pros and cons of using Playwright for web scraping:
- On-premise or cloud solution with no third party API
- Full browser control
- Customisable proxy pool
- Community support
- Low scalability
- High proxy prices
- Complicated maintenance
- High expenses for infrastructure to run CPU intense task
ScrapingAnt web scraping API is a web interface for performing data extraction at a scale that already uses headless browsers and rotating proxies inside. Only one action is needed - execute an API request with a target URL. Let's create the same function with using ScrapingAnt API:
- Simple usage
- Always reproducible results
- Low maintenance and usage price
- Large proxy pool
- High data extraction success rate
- Third-party API usage
- Paid when scraping web pages at scale
Model creation should be implemented in exactly the same way you're going to use for other pages scraping. If you're planning to use AutoScraper with
url param - you should also create the model with passing
url param. The different browsers may create different layouts, so a Playwright or a desktop Chrome model can not parse HTML retrieved with a direct URL propagation or content retrieved via ScrapingAnt web scraping API.
To save HTML for learning, you can use the following snippets:
It's time to glue-up everything together and create our e-commerce API based on Amazon and eBay. The final code contains both data extraction functions, so you'll be able to modify and use your favorite:
Execute the following command to run the API server:
By running this code, the API server will be up listening on port 8080. So let’s test our API by opening http://127.0.0.1:8080/?q=baking in our browser:
And finally, we're having our own custom multi-sourced e-commerce API. Just replace
baking in the URL with your desired search query to get its results.
This tutorial is intended for personal and educational use. I hope this article is useful and helps to grab all the provided code snippets to a fully-functional and production grade application.
I'd recommend following the links below to extend your knowledge and have a deep dive into the project parts:
- AutoScraper Github
- AutoScraper Examples
- Introducing AutoScraper: A Smart, Fast and Lightweight Web Scraper For Python
- AutoScraper and Flask: Create an API From Any Website in Less Than 5 Minutes
- FastAPI Documentation
- Playwright Documentation
- ScrapingAnt Documentation
Happy web scraping, and don't forget to check websites policies regarding the scraping bots 😉