Puppeteer is a high-level API to control headless Chrome. Most things that you can do manually in the browser can be done using Puppeteer, so it quickly became one of the most popular web scraping tool in Node.js and Python. Many developers use it for a single page applications (SPA) data extraction as it allows executing client-side Javascript. In this article, we are going to show how to set up a proxy in Puppeteer and how to spin up your own rotating proxy server.
Configuring proxy in Puppeteer
For requesting the target site via a proxy server we just should specify the --proxy-server
launch parameter with a proper proxy address. For example, http://10.10.10.10:8080
const puppeteer = require('puppeteer');
(async() => {
const browser = await puppeteer.launch({
args: [ '--proxy-server=http://10.10.10.10:8080' ]
});
const page = await browser.newPage();
await page.goto('https://httpbin.org/ip');
await browser.close();
})();
As a result, httpbin
should respond with a JSON, that contains the exact proxy server address, so the code above can be used for the further proxy IP address testing:
{
"origin": "10.10.10.10"
}
Pretty simple, isn't it? The only one downside of this approach, that the defined proxy server will be used for all the requests from the browser start, and for changing the proxy server the browser should be relaunched by puppeteer.launch
with a new proxy IP address.
Rotate proxy servers by your own
To avoid ban while web scraping you need to use different proxies and rotate them. In case of implementing your custom IP pool you'll need to re-launch your headless Chrome each time with a new proxy server settings. How to implement proxy rotation by each browser request?
The answer is pretty simple - you may intercept each request with your own proxy rotation tool! That kind of tool will handle proxy rotation for the browser, and you'll be able to save the precious time while web scraping.
To spin up proxy rotation server you may use the handy library proxy-chain and our free proxies list:
const proxies = {
'session_1': 'http://185.126.200.167:3128',
'session_2': 'http://116.228.227.211:443',
'session_3': 'http://185.126.200.152:3128',
};
const server = new ProxyChain.Server({
port: 8080,
prepareRequestFunction: ({ request }) => {
// At this point of code we should decide what proxy
// to use from the proxies list.
// You can chain your browser requests by header 'session-id'
// or just pick a random proxy from the list
const sessionId = request.headers['session-id'];
const proxy = proxies[sessionId];
return { upstreamProxyUrl: proxy };
}
});
server.listen(() => console.log('Rotating proxy server started.'));
The proxies in the example above can be outdated at the moment of article reading. You can find the freshest proxies at our Free proxy page.
The only one disadvantage of this method is that you have to handle bigger codebase and have a deep dive into networking, proxy management and maintenance.
One API call solution
In order to simplify the web scraper and have more space while scraping at scale you might want to get rid of the infrastructure pain and just focus on what you really want to achieve (extract the data).
Web ScrapingAnt API provides the ability to scrape the target page with only one API call. All the proxies rotation and headless Chrome rendering already handled by the API side. You can check out how simple it is with the ScrapingAnt Javascript client:
const ScrapingAntClient = require('@scrapingant/scrapingant-client');
const client = new ScrapingAntClient({ apiKey: '<YOUR-SCRAPINGANT-API-KEY>' });
// Check the proxy rotation
client.scrape('https://httpbin.org/ip')
.then(res => console.log(res))
.catch(err => console.error(err.message));
Or with a plain Javascript request to API (a bit more boilerplate code):
var http = require("https");
var options = {
"method": "POST",
"hostname": "api.scrapingant.com",
"port": null,
"path": "/v1/general",
"headers": {
"x-api-key": "<YOUR-SCRAPINGANT-API-KEY>",
"content-type": "application/json",
"accept": "application/json",
"useQueryString": true
}
};
var req = http.request(options, function (res) {
var chunks = [];
res.on("data", function (chunk) {
chunks.push(chunk);
});
res.on("end", function () {
var body = Buffer.concat(chunks);
console.log(body.toString());
});
});
req.write(JSON.stringify({
url: 'https://httpbin.org/ip',
}));
req.end();
With ScrapingAnt Web Scraping API, you can forget about any complications with IP rotation, and the internal anti-scraping avoiding mechanisms will help you to not be detected by Cloudflare. You can use it for free, follow here to sign in and get your API token.