6 Puppeteer Tricks to Avoid Detection and Make Web Scraping Easier
Oleg Kulyk
Co-Founder @ ScrapingAntAs you know, Puppeteer is a high-level API to control headless Chrome, and it's probably one of the most popular web scraping tools on the Internet. The only problem is that an average web developer might be overloaded by tons of possible settings for a proper web scraping setup.
I want to share 6 handy and pretty obvious tricks that should help web developers to increase web scraper success rate, improve performance and avoid bans.
#
What is Puppeteer?Puppeteer is a Node library which provides a high-level API to control Chrome or Chromium over the DevTools Protocol. Puppeteer runs headless by default, but can be configured to run full (non-headless) Chrome or Chromium.
The quote above means that Puppeteer allows automating your data extraction tasks and simulates real user behavior to avoid bans while web scraping. Also, Chromium will render Javascript, which is helpful for single-page applications (SPA) web scraping.
For more information, please, visit the official website.
#
6 Puppeteer tricks for Web ScrapingSince Puppeteer is rather complicated, many preferences and configurations a developer need to learn to properly scrape the web and reach a great success rate. I've prepared the top 6 obvious web scraping veterans tips that most regular web scraper developers often forget.
#
Headless modePuppeteer allows the user to activate it in a headless mode. Basically, it's a default Puppeteer's mode. This stops the browser from rendering on the screen and saves a lot of resources. It comes very skillfully when using the Puppeteer inside the Docker as it's impossible to use it in a full mode without xvfb
(virtual framebuffer) or an alternative tool.
To start Puppeteer in a headless mode, we will need to add headless: true
to the launch arguments or ignore passing this line to launch it in a headless mode by default.
#
Avoid creating a new tabThe most common misunderstanding that affects web scraper performance is opening a new Chromium tab on Puppeteer after browser launch. This common mistake results from many Puppeteer tutorials, and StackOverflow answers just code samples, not production-grade solutions.
This line looks like the following one:
To check this trick just run the following code after the browser launch. It shows the opened tabs count:
When launching a browser on Puppeteer, it launches with an open tab. To access the already opened page:
#
Proxy setupIt's important to use proxies while scraping at scale. When you try to scrape a website and visit over a certain number of pages, the rate-limiting defense mechanism will block your visits. Some sites will return 4**
status codes range when recognizing a scraping attempt or return an empty page with a Captcha check. A proxy allows to avoid IP ban and come over the rate limits while accessing a target site. You can check out the extended version of the Puppeteer proxy setup article or follow the useful snippets below.
When launching Puppeteer, you will need to give the given address as an array object with the field --proxy-server=<address>
which will send this parameter to the headless Chrome instance directly:
Setting up unauthenticated proxy:
For a proxy with a username/password you should pass the credentials on the page object itself. Use the page.authenticate()
method:
tip
You can try our free proxies to check out this code snippets
#
Setting up cookies and local storage dataThe HTTP protocol is stateless, but cookies and the WebStorage API allow it to keep context consistent over the session flow. It's very important to be able to store and re-use session data while scraping a site that requires authentication or authorization. Puppeteer's API becomes very helpful while dealing with a cookies flow control:
The following code snippet simulates the real cookies flow with help of HTTPBin
:
We are now able to read the file later and load the cookies into our new browser session:
tip
Cookies come with an expiration date, so make sure the ones you are trying to use not expired yet.
To access the local storage you need to evaluate a custom Javascript code inside the page's context:
To store the local storage data:
To read and pass inside the page context back:
#
Scrape like a ninja - use the stealth pluginIt might be hard to hide all the shreds of evidence of headless Chrome usage while scraping: web scraper developer should set screen resolution properly, a user agent should be configured to avoid fingerprinting, all the settings should look like a real browser. puppeteer-extra-plugin-stealth handles all the complications for you with just a few lines of code:
Setup:
Usage:
#
Using cloud APIIt's pretty important to use a reliable solution while web scraping at scale, so ScrapingAnt have created a simple API which will take care about rotating proxies, detection avoiding and headless Chrome for you.
You can check out how simple it is with the ScrapingAnt Javascript client:
Or with a plain Javascript request to API (a bit more boilerplate code):
With ScrapingAnt Web Scraping API, you can forget about any complications with IP rotation, and the internal anti-scraping avoiding mechanisms will help you to not be detected by Cloudflare. You can use it for free, follow here to sign in and get your API token.