In the current article, I’d like to share my experience with Amazon products scraping. The well-known Amazon marketplace offers the best deals for thousands of product types and from thousands of sellers. The potential amount of data to scrape is quite insane and can be used for:
- Market price comparison
- Price change tracking
- Analyzing product reviews
- Copyright check
- Finding the best products for selling or dropshipping
- A lot of data science and machine learning stuff
So Amazon looks like one of the biggest sources for receiving data for products. All the code will be provided in understandable chunks, but feel free to contact us if you have any questions. Let’s have a deeper dive!
For Amazon scraping I have selected the following stack:
- NodeJS as a platform for running JS code
- Cheerio library for DOM manipulation and data retrieving
- Got (unfortunately
requestpackage has been deprecated, so we'll do everything with
That’s all you need to start. Cheerio library allows you to act with DOM in JQuery-like style, so you can test all your selectors in the browser and then copy-paste them to your code. For more info just refer to the official page: https://cheerio.js.org/
Let’s scrape all the products with some particular keyword, for example,
The search URL will have the following look: https://amazon.com/s?k=baking+mat
So we can try to get the data from this page with the following code:
But in your console you will not see any products. Amazon recognized our request as a bot request and blocked us with Captcha. How to deal with it? We need to pretend to be a real browser (to store cookies, as Amazon proposes). So that problem is easily resolved with ScrapingAnt web scraping API, which under the hood runs a real Chrome browser that you can use from your code.
Let’s modify our code a bit for using ScrapingAnt API, but before it, we need to get an API token for using the API. Just visit https://app.scrapingant.com/login and register for Free to start web scraping a headless Chrome.
Can you see how the console content changed from the previous attempt? Now you can see the rendered HTML. But how to get, for example, product URLs? Let Cheerio help us.
After checking the response we can conclude, that all needed data from the search result is placed inside
div with attribute
data-index, also it has an attribute
data-asin, which is an Amazon product unique ID. Let’s grab them all to collect the links:
That’s all. Your console will be filled with a similar result:
There is a lot of possible data to scrape from the page: thumbnail, price, title, rating.
Also the pagination can be added, crawling through pages, etc., but all the described data can be retrieved from our open-source Amazon parser:
- GitHub page: https://github.com/ScrapingAnt/amazon_scraper
- NPM package page: https://www.npmjs.com/package/@scrapingant/amazon-proxy-scraper