Skip to main content

Scraping with millions of browsers or Puppeteer Cluster

· 3 min read
Oleg Kulyk

Scraping with millions of browsers or Puppeteer Cluster

In this article, we’d like to introduce an awesome open-source Web Scraping solution for running a pool of Chromium instances using Puppeteer.

Running a pool of Chromium instances using Puppeteer

Sequential execution to perform web scraping tasks is not a good idea as one process has to wait for the other processes to complete first. This is a time-consuming job when it comes to many processes waiting in a queue. So to overcome this we are going to perform these actions in parallel where the processes execute concurrently and result in less time consumption.

What does this library do?

  • Handling of crawling errors
  • Auto restarts the browser in case of a crash
  • Can automatically retry if a job fails
  • Different concurrency models to choose from (pages, contexts, browsers)
  • Simple to use, small boilerplate
  • Progress view and monitoring statistics (see below)

Cluster example

To read more about Puppeteer itself just visit the official Github page: https://github.com/puppeteer/puppeteer

Installing Puppeteer Cluster

To start using Puppeteer Cluster you should start by installing dependencies, for example, via NPM.

Install puppeteer (if you don't already have it installed):

npm install --save puppeteer

Then install puppeteer-cluster:

npm install --save puppeteer-cluster

Usage

All that you need to provide while using Puppeteer Cluster function is:

  • Process count that needs to be executed in parallel
const cluster = await Cluster.launch({
concurrency: Cluster.CONCURRENCY_CONTEXT,
maxConcurrency: 2 // Can be 1,000,00 but let's start from 2
});
  • Define a task that has to perform the expected action
await cluster.task(async ({ page, data: url }) => {
await page.goto(url);
//Perform the action, for example - store the result
})
  • Invoke the task using queue and wait until the cluster completes the execution
cluster.queue('http://www.google.com/');
cluster.queue('https://scrapingant.com/');

The whole script will look like the below example:

const { Cluster } = require('puppeteer-cluster');

(async () => {
const cluster = await Cluster.launch({
concurrency: Cluster.CONCURRENCY_CONTEXT,
maxConcurrency: 2,
});

await cluster.task(async ({ page, data: url }) => {
await page.goto(url);
//Perform the action, for example - store the result
});

cluster.queue('http://www.google.com/');
cluster.queue('https://scrapingant.com/');
// many more pages

await cluster.idle();
await cluster.close();
})();

And that is all.

Documentation

For the extensive documentation just visit the Github repository: https://github.com/thomasdondorf/puppeteer-cluster

And the examples directory inside the repository: https://github.com/thomasdondorf/puppeteer-cluster/tree/master/examples

Conclusion

Apart from scraping you can still make use of Puppeteer Cluster for automation testing, performance testing, improving your site rank, and many more cases with parallel browser workers.

Of course, you can try our Web Scraping API that supports parallel execution of headless Chrome rendering with a simple API.

Forget about getting blocked while scraping the Web

Try out ScrapingAnt Web Scraping API with thousands of proxy servers and an entire headless Chrome cluster