As you know, Puppeteer is a high-level API to control headless Chrome, and it's probably one of the most popular web scraping tools on the Internet. The only problem is that an average web developer might be overloaded by tons of possible settings for a proper web scraping setup.
I want to share 6 handy and pretty obvious tricks that should help web developers to increase web scraper success rate, improve performance and avoid bans.
Puppeteer is a Node library which provides a high-level API to control Chrome or Chromium over the DevTools Protocol. Puppeteer runs headless by default, but can be configured to run full (non-headless) Chrome or Chromium.
For more information, please, visit the official website.
Since Puppeteer is rather complicated, many preferences and configurations a developer need to learn to properly scrape the web and reach a great success rate. I've prepared the top 6 obvious web scraping veterans tips that most regular web scraper developers often forget.
Puppeteer allows the user to activate it in a headless mode. Basically, it's a default Puppeteer's mode. This stops the browser from rendering on the screen and saves a lot of resources. It comes very skillfully when using the Puppeteer inside the Docker as it's impossible to use it in a full mode without
xvfb (virtual framebuffer) or an alternative tool.
To start Puppeteer in a headless mode, we will need to add
headless: true to the launch arguments or ignore passing this line to launch it in a headless mode by default.
The most common misunderstanding that affects web scraper performance is opening a new Chromium tab on Puppeteer after browser launch. This common mistake results from many Puppeteer tutorials, and StackOverflow answers just code samples, not production-grade solutions.
This line looks like the following one:
To check this trick just run the following code after the browser launch. It shows the opened tabs count:
When launching a browser on Puppeteer, it launches with an open tab. To access the already opened page:
It's important to use proxies while scraping at scale. When you try to scrape a website and visit over a certain number of pages, the rate-limiting defense mechanism will block your visits. Some sites will return
4** status codes range when recognizing a scraping attempt or return an empty page with a Captcha check. A proxy allows to avoid IP ban and come over the rate limits while accessing a target site. You can check out the extended version of the Puppeteer proxy setup article or follow the useful snippets below.
When launching Puppeteer, you will need to give the given address as an array object with the field
--proxy-server=<address> which will send this parameter to the headless Chrome instance directly:
Setting up unauthenticated proxy:
For a proxy with a username/password you should pass the credentials on the page object itself. Use the
You can try our free proxies to check out this code snippets
The HTTP protocol is stateless, but cookies and the WebStorage API allow it to keep context consistent over the session flow. It's very important to be able to store and re-use session data while scraping a site that requires authentication or authorization. Puppeteer's API becomes very helpful while dealing with a cookies flow control:
The following code snippet simulates the real cookies flow with help of
We are now able to read the file later and load the cookies into our new browser session:
Cookies come with an expiration date, so make sure the ones you are trying to use not expired yet.
To store the local storage data:
To read and pass inside the page context back:
It might be hard to hide all the shreds of evidence of headless Chrome usage while scraping: web scraper developer should set screen resolution properly, a user agent should be configured to avoid fingerprinting, all the settings should look like a real browser. puppeteer-extra-plugin-stealth handles all the complications for you with just a few lines of code:
It's pretty important to use a reliable solution while web scraping at scale, so ScrapingAnt have created a simple API which will take care about rotating proxies, detection avoiding and headless Chrome for you.
With ScrapingAnt API, you can forget about any complications with IP rotation, and the internal anti-scraping avoiding mechanisms will help you to not be detected by Cloudflare. You can use it for free, follow here to sign in and get your API token.