In the current article, I’d like to share my experience with Amazon products scraping. The well-known Amazon marketplace offers the best deals for thousands of product types and from thousands of sellers. The potential amount of data to scrape is quite insane and can be used for:
- Market price comparison
- Price change tracking
- Analyzing product reviews
- Copyright check
- Finding the best products for selling or dropshipping
- A lot of data science and machine learning stuff
So Amazon looks like one of the biggest sources for receiving data for products. All the code will be provided in understandable chunks, but feel free to contact us if you have any questions. Let’s have a deeper dive!
Technology stack for product scraping
For Amazon scraping I have selected the following stack:
- NodeJS as a platform for running JS code
- Cheerio library for DOM manipulation and data retrieving
- Got (unfortunately
request
package has been deprecated, so we'll do everything withgot
)
That’s all you need to start. Cheerio library allows you to act with DOM in JQuery-like style, so you can test all your selectors in the browser and then copy-paste them to your code. For more info just refer to the official page: https://cheerio.js.org/
Exact Web Scraping
Let’s scrape all the products with some particular keyword, for example, baking mat
The search URL will have the following look: https://amazon.com/s?k=baking+mat
So we can try to get the data from this page with the following code:
const got = require('got');
const keyword = "baking mat";
got(`https://amazon.com/s?k=${keyword}`).then(response => console.log(response.body));
But in your console you will not see any products. Amazon recognized our request as a bot request and blocked us with Captcha. How to deal with it? We need to pretend to be a real browser (to store cookies, as Amazon proposes). So that problem is easily resolved with ScrapingAnt web scraping API, which under the hood runs a real Chrome browser that you can use from your code.
Let’s modify our code a bit for using ScrapingAnt API, but before it, we need to get an API token for using the API. Just visit https://app.scrapingant.com/login and register for Free to start web scraping a headless Chrome.
const got = require('got');
const apiToken = "<YOUR-SCRAPING-ANT-API-TOKEN>";
const keyword = "baking mat";
got.post('https://api.scrapingant.com/v1/general', {
headers: {
"x-api-key": apiToken,
"content-type": "application/json",
"accept": "application/json"
},
json: {
url: `https://amazon.com/s?k=${keyword}`,
}
}).then(response => console.log(response.body));
Can you see how the console content changed from the previous attempt? Now you can see the rendered HTML. But how to get, for example, product URLs? Let Cheerio help us.
After checking the response we can conclude, that all needed data from the search result is placed inside div
with attribute data-index
, also it has an attribute data-asin
, which is an Amazon product unique ID. Let’s grab them all to collect the links:
const cheerio = require('cheerio');
const got = require('got');
const apiToken = "<YOUR-SCRAPING-ANT-API-TOKEN>";
const keyword = "baking mat";
got.post('https://api.scrapingant.com/v1/general',
{
headers: {
"x-api-key": apiToken,
"content-type": "application/json",
"accept": "application/json"
},
json: {
url: `https://amazon.com/s?k=${keyword}`,
}
}).then(response => {
const dom = cheerio.load(response.body);
const productList = dom(`div[data-index]`);
for (let i = 0; i < productList.length; i++) {
if (!productList[i].attribs['data-asin']) {
continue;
}
console.log(`https://amazon.com/dp/${productList[i].attribs['data-asin']}`);
}
});
That’s all. Your console will be filled with a similar result:
https://amazon.com/dp/B078X6QYNL
https://amazon.com/dp/B06XBX9ND2
https://amazon.com/dp/B07MK2P53L
https://amazon.com/dp/B01ACUA8HC
https://amazon.com/dp/B07MK2P53L
https://amazon.com/dp/B07MZ5LTWQ
https://amazon.com/dp/B07MRM3Q4F
https://amazon.com/dp/B07JHXKHKB
https://amazon.com/dp/B07YS8VZ38
https://amazon.com/dp/B01HOJ2V06
https://amazon.com/dp/B07WWSXVDH
...
What’s next?
There is a lot of possible data to scrape from the page: thumbnail, price, title, rating.
Also the pagination can be added, crawling through pages, etc., but all the described data can be retrieved from our open-source Amazon parser:
- GitHub page: https://github.com/ScrapingAnt/amazon_scraper
- NPM package page: https://www.npmjs.com/package/@scrapingant/amazon-proxy-scraper