Turn Any Website Into An API with AutoScraper and FastAPI
Oleg Kulyk
Co-Founder @ ScrapingAntIn this article, we will learn how to create a simple e-commerce search API with multiple platform support: eBay and Amazon. AutoScraper and FastAPi provide the ability to create a powerful JSON API for the date. With Playwright's help, we'll extend our scraper and avoid blocking by using ScrapingAnt's web scraping API.
#
Let's satisfy the requirementsTo complete the whole tutorial we'll need the following libraries:
- FastAPI - to implement the API
- Uvicorn - to serve the API
- AutoScraper - to parse data into structured format
- Playwright or ScrapingAnt API client - for blocking avoidance and web scraping at scale
To install all the requirements via pip
just run the following command (Python 3.7+ required):
tip
Not all the described libraries are required for a complete project, but I'd suggest installing them to try different ways of data extraction.
#
Scrape one page, and you'll scrape them allLet's start from the most interesting part - creating a smart scraper to fetch data from Amazon's search result page. For a simple case, we'll use the title
, price
and product link
as a required data for extraction. To generate a product link for Amazon's product we'll need to get a unique product id like B077XTPWZ5
.
Using AutoScraper, it would be easily done by just providing some sample data:
The code above tells us how to apply several samples when the structure of items on search result pages is different, and we want the scraper to learn them all, which is redundant for the current case and added to show this cool feature.
note
If you want to copy and run this code, you may need to update the wanted_dict
with your values.
From the output, we’ll know which rule corresponds to which data:
After analyzing the output, let’s keep only desired rules, remove the rest, and save our model:
note
Let's do the same trick for eBay:
... and the execution hangs.
What is the reason? Why it not worked?
AutoScraper is a good tool for automatic intelligent parsing. Still, it does not have capabilities for proper data extraction from the page, so the URL's direct data extraction can not work for some sites.
Let's try the workaround for the learning phase only: save the required page as HTML via the browser and then open via Python to pass the data into AutoScraper:
How to extract data and avoid anti-scraping techniques? Let's try out two different approaches:
#
PlaywrightPlaywright is a high-level API over headless Chrome, Firefox and Webkit browser. It means that you can automate the action that we made above (open the page's HTML), and this action will not trigger the website's anti-scraping mechanism. Let's create a function that does content extraction:
This function allows us to get the page's HTML content without being blocked and detected as a bot.
Also, we'll need to apply rotating proxies to avoid rate-limiting for the web scraping at scale. Check out how to use a proxy in Playwright.
Let's check out the pros and cons of using Playwright for web scraping:
Pros:
- On-premise or cloud solution with no third party API
- Full browser control
- Customisable proxy pool
- Community support
Cons:
- Low scalability
- High proxy prices
- Complicated maintenance
- High expenses for infrastructure to run CPU intense task
#
ScrapingAnt APIScrapingAnt web scraping API is a web interface for performing data extraction at a scale that already uses headless browsers and rotating proxies inside. Only one action is needed - execute an API request with a target URL. Let's create the same function with using ScrapingAnt API:
note
To get you API token, please, visit Login page to authorize in ScrapingAnt User panel. It's free.
Pros:
- Simple usage
- Always reproducible results
- Low maintenance and usage price
- Large proxy pool
- High data extraction success rate
Cons:
- Third-party API usage
- Paid when scraping web pages at scale
#
Learning phase noteswarning
Model creation should be implemented in exactly the same way you're going to use for other pages scraping. If you're planning to use AutoScraper with url
param - you should also create the model with passing url
param. The different browsers may create different layouts, so a Playwright or a desktop Chrome model can not parse HTML retrieved with a direct URL propagation or content retrieved via ScrapingAnt web scraping API.
To save HTML for learning, you can use the following snippets:
#
Playwright#
ScrapingAnt#
Create your own APIIt's time to glue-up everything together and create our e-commerce API based on Amazon and eBay. The final code contains both data extraction functions, so you'll be able to modify and use your favorite:
Execute the following command to run the API server:
By running this code, the API server will be up listening on port 8080. So let’s test our API by opening http://127.0.0.1:8080/?q=baking in our browser:
And finally, we're having our own custom multi-sourced e-commerce API. Just replace baking
in the URL with your desired search query to get its results.
#
ConclusionThis tutorial is intended for personal and educational use. I hope this article is useful and helps to grab all the provided code snippets to a fully-functional and production grade application.
I'd recommend following the links below to extend your knowledge and have a deep dive into the project parts:
- AutoScraper Github
- AutoScraper Examples
- Introducing AutoScraper: A Smart, Fast and Lightweight Web Scraper For Python
- AutoScraper and Flask: Create an API From Any Website in Less Than 5 Minutes
- FastAPI Documentation
- Playwright Documentation
- ScrapingAnt Documentation
Happy web scraping, and don't forget to check websites policies regarding the scraping bots 😉