In this article, we’d like to introduce an awesome open-source Web Scraping solution for running a pool of Chromium instances using Puppeteer.
Sequential execution to perform web scraping tasks is not a good idea as one process has to wait for the other processes to complete first. This is a time-consuming job when it comes to many processes waiting in a queue. So to overcome this we are going to perform these actions in parallel where the processes execute concurrently and result in less time consumption.
What does this library do?
- Handling of crawling errors
- Auto restarts the browser in case of a crash
- Can automatically retry if a job fails
- Different concurrency models to choose from (pages, contexts, browsers)
- Simple to use, small boilerplate
- Progress view and monitoring statistics (see below)
To read more about Puppeteer itself just visit the official Github page: https://github.com/puppeteer/puppeteer
Usage
To start using Puppeteer Cluster you should start from installing dependencies, for example, via NPM.
Install puppeteer (if you don’t already have it installed):
npm install --save puppeteer
Install puppeteer-cluster:
npm install --save puppeteer-cluster
All that you need to provide while using Puppeteer Cluster function is:
- Process count that needs to be executed in parallel
const cluster = await Cluster.launch({
concurrency: Cluster.CONCURRENCY_CONTEXT,
maxConcurrency: 2 // Can be 1,000,00 but let's start from 2
});
- Define a task that has to perform the expected action
await cluster.task(async ({ page, data: url }) => {
await page.goto(url);
//Perform the action, for example - store the result
})
- Invoke the task using queue and wait until the cluster completes the execution
cluster.queue('http://www.google.com/');
cluster.queue('https://scrapingant.com/');
- The whole script will look like the below example:
const { Cluster } = require('puppeteer-cluster');
(async () => {
const cluster = await Cluster.launch({
concurrency: Cluster.CONCURRENCY_CONTEXT,
maxConcurrency: 2,
});
await cluster.task(async ({ page, data: url }) => {
await page.goto(url);
//Perform the action, for example - store the result
});
cluster.queue('http://www.google.com/');
cluster.queue('https://scrapingant.com/');
// many more pages
await cluster.idle();
await cluster.close();
})();
And that is all.
For the extensive documentation just visit the Github repository: https://github.com/thomasdondorf/puppeteer-cluster
And the examples directory inside the repository: https://github.com/thomasdondorf/puppeteer-cluster/tree/master/examples
Conclusion
Apart from scraping you can still make use of Puppeteer Cluster for automation testing, performance testing, improving your site rank, and many more cases with parallel browser workers.