Open source scraping with million of browsers or Puppeteer Cluster

In this article, we’d like to introduce an awesome open source web scraping solution for running a pool of Chromium instances using Puppeteer.

Sequential execution to perform web scraping tasks is not a good idea as one process has to wait for the other process to complete first. This is a time-consuming job when it comes to many processes waiting in a queue. So to overcome this we are going to perform these actions in parallel where the processes execute concurrently and result in less time consumption.

What does this library do?

  • Handling of crawling errors
  • Auto restarts the browser in case of a crash
  • Can automatically retry if a job fails
  • Different concurrency models to choose from (pages, contexts, browsers)
  • Simple to use, small boilerplate
  • Progress view and monitoring statistics (see below)

To read more about Puppeteer itself just visit the official Github page: https://github.com/puppeteer/puppeteer

Usage

To start using Puppeteer Cluster you should start from installing dependencies, for example, via NPM.

Install puppeteer (if you don’t already have it installed):


npm install --save puppeteer

Install puppeteer-cluster:


npm install --save puppeteer-cluster

All that you need to provide while using Puppeteer Cluster function is:

  • Process count that needs to be executed in parallel

const cluster = await Cluster.launch({
    concurrency: Cluster.CONCURRENCY_CONTEXT,
    maxConcurrency: 2 // Can be 1,000,00 but let's start from 2
});

  • Define a task that has to perform the expected action

await cluster.task(async ({ page, data: url }) => {
    await page.goto(url);
    //Perform the action, for example - store the result
})

  • Invoke the task using queue and wait until the cluster completes the execution

cluster.queue('http://www.google.com/');
cluster.queue('https://scrapingant.com/');

  • The whole script will look like the below example:

const { Cluster } = require('puppeteer-cluster');

(async () => {
  const cluster = await Cluster.launch({
    concurrency: Cluster.CONCURRENCY_CONTEXT,
    maxConcurrency: 2,
  });

  await cluster.task(async ({ page, data: url }) => {
    await page.goto(url);
    //Perform the action, for example - store the result
  });

  cluster.queue('http://www.google.com/');
  cluster.queue('https://scrapingant.com/');
  // many more pages

  await cluster.idle();
  await cluster.close();
})();

And that is all.

For the extensive documentation just visit the Github repository: https://github.com/thomasdondorf/puppeteer-cluster

And the examples directory inside the repository: https://github.com/thomasdondorf/puppeteer-cluster/tree/master/examples

Conclusion

Apart from scraping you can still make use of Puppeteer Cluster for automation testing, performance testing, improving your site rank, and many more cases with parallel browser workers.

Close Bitnami banner
Bitnami