Amazon Product Scraping. Relatively easy.

In the current article I’d like to share my experience with Amazon products scraping.
The well known Amazon marketplace offers the best deals for thousands of product types and from the same amount of sellers. The amount of possible data to scrape is quite insane and can be used for:

  • Market price comparison
  • Price change tracking
  • Analyzing product reviews
  • Copyright check
  • Finding the best products for selling or dropshipping
  • And a lot of data science and machine learning stuff

So Amazon looks like a one of the biggest sources for receiving data for products. All the code will be provided in the understandable chunks, but feel free to contact us if you have any questions. Let’s have more dipper dive!

Technology stack for product scraping

For Amazon scraping I have selected the following stack:

  •  NodeJS as a platform for running JS code
  • Cheerio library for DOM manipulation and data retrieving 
  • Got (unfortunately request package has been deprecated)

And that’s pretty all you need.
Cheerio library allows you to act with DOM in JQuery-like style, so you can test all your selectors in the browser and then copy-paste them to your code. For more info just refer to the official page: https://cheerio.js.org/

Exact web scraping

Let’s scrape all the products with some particular keyword, for example, baking mat

The search URL will have the following look: https://amazon.com/s?k=baking+mat

So we can try to get the data from this page with the following code:


const got = require('got');

const keyword = "baking mat";

got(`https://amazon.com/s?k=${keyword}`).then(response => console.log(response.body));

But in your console you will not see any products. Amazon recognized our request as a bot request and blocked us with Captcha. How to deal with it? We need to pretend like a real browser (to store cookies, as Amazon proposes). So that problem easily resolved with ScrapingAnt API, which under the hood runs real Chrome browser, that you can use from your code.

Let’s modify our code a bit for using ScrapingAnt API, but before it, we need to get an API key for using the API. Just visit https://rapidapi.com/okami4kak/api/scrapingant/pricing and select a free Basic plan.


const got = require('got');

const rapidApiKey = "pass your RapidAPI key here";
const keyword = "baking mat";

got.post('https://scrapingant.p.rapidapi.com/post', {
        headers: {
            "x-rapidapi-host": "scrapingant.p.rapidapi.com",
            "x-rapidapi-key": rapidApiKey,
            "content-type": "application/json",
            "accept": "application/json"
        },
        json: {
            url: `https://amazon.com/s?k=${keyword}`,
        }
    }).then(response => console.log(response.body));


Can you see how the console content changed from the previous attempt? Now you can see the rendered HTML. But how to get, for example, product URLs? Let’s Cheerio help us.

After checking the response bode we can conclude, that all needed data from the search result is placed inside div with attribute data-index, also it has attribute data-asin, which is Amazon product unique ID. Let’s grab them all to collect the links:


const got = require('got');
const cheerio = require('cheerio');


const rapidApiKey = "pass your RapidAPI key here";
const keyword = "baking mat";

got.post('https://scrapingant.p.rapidapi.com/post', {
    headers: {
        "x-rapidapi-host": "scrapingant.p.rapidapi.com",
        "x-rapidapi-key": rapidApiKey,
        "content-type": "application/json",
        "accept": "application/json"
    },
    json: {
        url: `https://amazon.com/s?k=${keyword}`,
    }
}).then(response => {
    const dom = cheerio.load(response.body);

    const productList = dom(`div[data-index]`);

    for (let i = 0; i < productList.length; i++) {
        if (!productList[i].attribs['data-asin']) {
            continue;
        }

        console.log(`https://amazon.com/dp/${productList[i].attribs['data-asin']}`);
    }
})

That’s all. Your console will be filled with a similar result:


https://amazon.com/dp/B078X6QYNL
https://amazon.com/dp/B06XBX9ND2
https://amazon.com/dp/B07MK2P53L
https://amazon.com/dp/B01ACUA8HC
https://amazon.com/dp/B07MK2P53L
https://amazon.com/dp/B07MZ5LTWQ
https://amazon.com/dp/B07MRM3Q4F
https://amazon.com/dp/B07JHXKHKB
https://amazon.com/dp/B07YS8VZ38
https://amazon.com/dp/B01HOJ2V06
https://amazon.com/dp/B07WWSXVDH
…..


What’s next?

There are a lot of possible data to scrape from the page: thumbnail, price, title, rating.
Also the paginating can be added, crawling through pages, etc., but all the described data can be retrieved from our open-source Amazon parser:

GitHub page: https://github.com/ScrapingAnt/amazon_scraper
NPM package page: https://www.npmjs.com/package/@scrapingant/amazon-proxy-scraper

Close Bitnami banner
Bitnami