Scraping with millions of browsers or Puppeteer Cluster

In this article, we’d like to introduce an awesome open-source Web Scraping solution for running a pool of Chromium instances using Puppeteer.

Running a pool of Chromium instances using Puppeteer

Sequential execution to perform web scraping tasks is not a good idea as one process has to wait for the other processes to complete first. This is a time-consuming job when it comes to many processes waiting in a queue. So to overcome this we are going to perform these actions in parallel where the processes execute concurrently and result in less time consumption.

What does this library do?

Handling of crawling errors
Auto restarts the browser in case of a crash
Can automatically retry if a job fails
Different concurrency models to choose from (pages, contexts, browsers)
Simple to use, small boilerplate
Progress view and monitoring statistics (see below)

Cluster example

To read more about Puppeteer itself just visit the official Github page: https://github.com/puppeteer/puppeteer

Installing Puppeteer Cluster

To start using Puppeteer Cluster you should start by installing dependencies, for example, via NPM.

Install puppeteer (if you don't already have it installed):

npm install --save puppeteer

Then install puppeteer-cluster:

npm install --save puppeteer-cluster

Usage

All that you need to provide while using Puppeteer Cluster function is:

Process count that needs to be executed in parallel

const cluster = await Cluster.launch({
    concurrency: Cluster.CONCURRENCY_CONTEXT,
    maxConcurrency: 2 // Can be 1,000,00 but let's start from 2
});

Define a task that has to perform the expected action

await cluster.task(async ({ page, data: url }) => {
    await page.goto(url);
    //Perform the action, for example - store the result
})

Invoke the task using queue and wait until the cluster completes the execution

cluster.queue('http://www.google.com/');
cluster.queue('https://scrapingant.com/');

The whole script will look like the below example:

const { Cluster } = require('puppeteer-cluster');

(async () => {
  const cluster = await Cluster.launch({
    concurrency: Cluster.CONCURRENCY_CONTEXT,
    maxConcurrency: 2,
  });

  await cluster.task(async ({ page, data: url }) => {
    await page.goto(url);
    //Perform the action, for example - store the result
  });

  cluster.queue('http://www.google.com/');
  cluster.queue('https://scrapingant.com/');
  // many more pages

  await cluster.idle();
  await cluster.close();
})();

And that is all.

Documentation

For the extensive documentation just visit the Github repository: https://github.com/thomasdondorf/puppeteer-cluster

And the examples directory inside the repository: https://github.com/thomasdondorf/puppeteer-cluster/tree/master/examples

Conclusion

Apart from scraping you can still make use of Puppeteer Cluster for automation testing, performance testing, improving your site rank, and many more cases with parallel browser workers.

Of course, you can try our Web Scraping API that supports parallel execution of headless Chrome rendering with a simple API.

Scraping with millions of browsers or Puppeteer Cluster

Running a pool of Chromium instances using Puppeteer

Installing Puppeteer Cluster

Usage

Documentation

Conclusion

Forget about getting blocked while scraping the Web

LLM-ready data extraction

Running a pool of Chromium instances using Puppeteer​

Installing Puppeteer Cluster​

Usage​

Documentation​

Conclusion​

Forget about getting blocked while scraping the Web

LLM-ready data extraction

Running a pool of Chromium instances using Puppeteer

Installing Puppeteer Cluster

Usage

Documentation

Conclusion