In this article, I’d like to share a quick guide of how to run Playwright inside AWS Lambda. There are a bunch of similar guides about Puppeteer, but only a few are about the successor from Microsoft.
PlayWright 101
Before getting into the AWS Lambda portion of this, let’s briefly go over what we are trying to achieve with PlayWright. In order to get the content of a given URL with PlayWright, we have to go through four steps:
- Launch a new browser
- Open a new page
- Navigate to the given URL
- Get the page content
Here’s what that looks like:
const { chromium } = require('playwright');
(async () => {
const browser = await chromium.launch();
const context = await browser.newContext();
const page = await context.newPage();
await page.goto('https://scrapingant.com/');
const result = await page.content();
console.log(result);
await browser.close();
})();
AWS Lambda 101
These are NodeJS (in our case) functions that can be called from the frontend website or any other code via HTTP/SDK request. They give us the power of a backend server, without having to worry about actually creating and maintaining a fully blown API.
AWS Lambda beginners guide: https://aws.amazon.com/lambda/getting-started/
The extensive guide for AWS Lambda with SAM: https://itnext.io/creating-aws-lambda-applications-with-sam-dd13258c16dd
The sample of single AWS Lambda function is shown below:
exports.handler = async (event, context) => {
// do your stuff here
}
For example, if we’d like to implement AWS Lambda scraper:
exports.handler = async (event, context) => {
const params = JSON.parse(event.body);
const pageToScrape = params.pageToScrape;
// exact scraping by pageToScrape
}
Putting Playwright and AWS Lambda together
Probably you’d like to know why not just use Playwright library as is? The problem is in the Chromium binaries inside the library that should be compiled for it, but this option is not supported by Microsoft by default.
So where to find the right binaries?
It’s a Chromium Binary for AWS Lambda and Google Cloud Functions.
And to connect it we have two options:
1. Lazy and simple headless Chrome running
The following library and NPM package gives the support of both Playwright and Chromium Binaries from chrome-aws-lambda
: https://github.com/JupiterOne/playwright-aws-lambda
It can be installed via npm
:
npm install playwright-core playwright-aws-lambda --save
And our final code will have the following look:
const playwright = require('playwright-aws-lambda');
exports.handler = async (event, context) => {
const params = JSON.parse(event.body);
const pageToScrape = params.pageToScrape;
let result = null;
let browser = null;
try {
const browser = await playwright.launchChromium();
const context = await browser.newContext();
const page = await context.newPage();
await page.goto(pageToScrape);
const result = await page.content();
console.log(result);
} catch (error) {
throw error;
} finally {
if (browser !== null) {
await browser.close();
}
}
};
But, you’ll have no ability to update the Playwright version and Chromium version without waiting or contributing to playwright-aws-lambda library.
2. Flexible and maintainable
We can connect both Playwright and aws-chrome-lambda
by ourselves.
Installing all the dependencies:
npm install chrome-aws-lambda playwright-core --save
And pass the Chromium executable path to PlayWright in the following way:
const { chromium } = require('playwright-core');
const awsChromium = require('chrome-aws-lambda');
//.....
const browser = await chromium.launch({
headless: false,
executablePath: awsChromium.executablePath,
});
//.....
So in this way, we will be able to modify both PlayWright and chrome-aws-lambda
versions, but it may be a bit difficult due to not all the versions are cross-compatible, so it just a start vector for your further experiments.
Conclusion
By using a Playwright you can get the latest browser API features from the former Puppeteer team, but the community support is still not really impressive and some of the issues should be resolved on your own.
To know more about Playwright just visit the official Github repo: https://github.com/microsoft/playwright
As well, you can just use our web scraping API to throw away all the difficulties and just enjoy your data mining experience.