Web scraping a website with the actually supported or other browsers has a real benefit in ensuring that the scraper will not be banned by the fingerprint or the behavioral pattern. Playwright already provides full support for Chromium, Firefox, and WebKit out of the box without installing the browsers manually, but since most of the users out there use Google Chrome or Microsoft Edge instead of the open-source Chromium variant, in some scenarios, it's safer to use them to emulate a more realistic browser environment.
What browsers can I use?
All browsers that are based on the Chromium browser can be used with this technique. Playwright interacts with them over the Chrome DevTools Protocol to open new tabs, click on elements or execute JavaScript. Due to this core requirement, we have to use a recent version (daily build - Canary) of them to ensure that the needed API schemas match and exist. To use them, we only have to adjust the executable path option that Playwright will use to launch the browsers.
On macOS systems, the browsers are installed in the /Applications
directory, with the related binaries. For Linux, the browsers are commonly installed in the /usr/bin
directory; you'll find some examples below. On Windows systems, the browsers are installed in the C:\Program Files (x86)\
directory.
Check out examples of Canary and Nightly build places inside popular macOS, Windows and Linux directories:
/Applications/Microsoft Edge Canary.app/Contents/MacOS/Microsoft Edge Canary
- Microsoft Edge Canary on macOS/Applications/Google Chrome Canary.app/Contents/MacOS/Google Chrome Canary
- Google Chrome Canary on macOS/usr/bin/google-chrome-unstable
- Google Chrome Canary on UbuntuC:\Users\<username>\AppData\Local\Google\Chrome SxS\Application\chrome.exe
- Google Chrome Canary on Windows/Applications/Brave Browser Nightly.app/Contents/MacOS/Brave Browser Nightly
- Brave Nightly on macOS
To find out the exact executable path for the browser, just open the following links inside the browser: edge://version - for Microsoft Edge, chrome://version - for Google Chrome, and brave://version - for Brave.
Connecting Chrome, Microsoft Edge, Brave, and any other Chromium-based browser with Playwright
To launch the selected browser from code you just need to pass the executablePath
inside the launch
function:
const playwright = require("playwright-core");
(async () => {
const browser = await playwright.chromium.launch({
headless: false,
executablePath: `/Applications/Microsoft\ Edge\ Canary.app/Contents/MacOS/Microsoft\ Edge\ Canary`
})
const page = await browser.newPage()
await page.goto("https://scrapingant.com/")
console.log(await page.content())
await page.screenshot({ path: "screenshot.png" })
await browser.close()
})()
As the browser launches with headless: false
option you'll be able to observe the browser start. Also, we're using the playwright-core package, which only installs the library instead of downloading the browsers which we don't need in our case.
Where to get the browsers?
Google Chrome
Only the Canary builds are eligible for use with Playwright. To get one just visit the official website.
Microsoft Edge
To use Playwright, we need a recent Canary build too. Starting from Oct 2020 you're able to use it on Linux as well. The browser can be downloaded on the official website.
Brave
Brave itself does not rely on the official Chromium release schedule, that's why their latest versions are not the same as Chromium. There are no guarantees that all Playwright functionality will work out of the box. If you still want to try it out, you can obtain their Nightly version on their official website.
Summary
In this article, we've observed a pretty easy way of connecting Chromium-based browsers with Playwright. To avoid getting blocked, it's a nice way of blurring the browser fingerprint more than using just common techniques like using the stealth plugin. For the advanced usage info and documentation about Playwright features, please, follow the official website playwright.dev.
Happy web scraping and don't forget to pass the cookies while data extraction!