Skip to main content

Web Scraping with Playwright in 6 Simple Steps

· 9 min read
Oleg Kulyk

Web Scraping with Playwright in 6 Simple Steps

Web scraping is the process of extracting necessary data from external websites. It’s a valuable skill that helps you gather large amounts of data from the internet for various purposes. However, it can be daunting if you don’t need what tools to use.

Playwright is an effective web scraping tool that can facilitate data extraction with minimal effort. This Playwright tutorial will walk you through the six easy steps necessary to implement Playwright web scraping, complete with helpful hints, best practices, and examples.

A Word of Caution: Responsible Web Scraping

Web scraping is a powerful resource but must be used ethically and responsibly. Some recommendations for optimal behavior are as follows.

  1. Adhere to robots.txt and terms-of-service files on websites. Before scraping any content from the site, you should familiarize yourself with the website's terms of service and robots.txt file. Websites may outright forbid scraping or place caps on how often and frequently you can do it.
  2. Don't overload websites by sending too many requests at once to a single website, as this can impede the site's performance and inhibit other users from accessing it. Respect the site's resources by employing throttling and rate limiting to ensure that your scraping doesn't harm the website's performance.
  3. Don't scrape sensitive information like login credentials, bank account details, or any other private data. Doing so is not only unethical but may also be illegal.
  4. Use reputable scraping tools. Make use of scraping software like ScrapingAnt and Playwright, which are both efficient and respectful. Don't use anything that could unethically slow down a website or scrape data.

If you follow these guidelines and use Playwright for web scraping, you can be confident that your data extraction process is completely responsible and ethical.

Playwright Web Scraping Step-by-Step Guide

Here’s how to use Playwright to extract data:

Step 1: Install Playwright

The first step before scraping the web with Playwright is installing it. Playwright is a Node.js library, so you'll need to have Node.js installed on your computer before using it. If you already have Node.js set up, you can install the tool by typing the following command into your terminal or command prompt:

npm install playwright

Step 2: Launch a Browser

After installing Playwright, you'll need to launch a browser to start scraping. Playwright is compatible with Chromium, Firefox, and WebKit. You can choose the browser you prefer. For example, to launch Chromium, you can use the following code

const { chromium } = require('playwright');
(async () => {
const browser = await chromium.launch();
const context = await browser.newContext();
const page = await context.newPage();
// Navigate to a website
await page.goto('https://www.example.com');
// Do something on the website
// ...
await browser.close();
})();

Best Practice: Use a Headless Browser

It’s recommended to use a headless browser when web scraping. Headless browsers are browsers without a graphical user interface. They run in the background and can be faster and more efficient than browsers with a user interface.

To launch a headless browser, you can add the headless: true option to the launch() method:

const { chromium } = require('playwright');
(async () => {
const browser = await chromium.launch({ headless: true });
// ...
})();

Step 3: Navigate to a Website

Once you've launched a browser, the next step is to navigate to the website you want to scrape. You can use the page.goto() method to navigate a website. For example:

await page.goto('https://www.example.com');

Best Practice: Set a User Agent

Some websites may block your requests if they detect that you are a web scraper. You can prevent this by setting your browser's user agent (a string that identifies the browser and operating system you are using). To set a user agent, use the setUserAgent() method:

const { chromium } = require('playwright');
(async () => {
const browser = await chromium.launch();
const context = await browser.newContext({ userAgent: 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)' +
' AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36', });
const page = await context.newPage();
// ...
})();

Step 4: Extract Data

Once you've arrived at the target website, you can use Playwright to extract the required data. The tool provides several methods for extracting data, including page.$(), page.$$(), and page.evaluate().

For example, you can use the page to extract the website title using title() method:

const pageTitle = await page.title(); 
console.log(pageTitle);

The page.$eval() function is used to parse text from a website element. For example, to extract the text from the first <h1> element on the website, you can use the following code:

const elementText = await page.$eval('h1', (el) => el.textContent); 
console.log(elementText);

You can also use the page.$$eval() function to extract data from multiple elements on the website. For example, to extract the text from all the <h1> elements on the website, you can use the following code:

const elementTexts = await page.$$eval('h1', (els) => els.map((el) => el.textContent)); 
console.log(elementTexts);

Best practice: use selectors

Using selectors to zero in on the specific elements you need to extract data from on a website is crucial. Selectors function like street addresses in that they identify the targeted HTML elements. Playwright supports several selector languages, including CSS selectors, XPath expressions, and others.

For example, to select all the <a> elements on a website, you can use the following code:

const links = await page.$$('a');
console.log(links.length);

Step 5: Handle Navigation and User Input

Playwright's capacity to process navigation and user input is one of its most powerful features. You can click buttons, fill out forms, and navigate different website pages.

For example, you can use the page to navigate to a different page on a website. goto() method:

await page.goto('https://www.example.com/page2');

You can use the page to fill out a form on a website. type() method:

await page.type('#username', 'myusername'); 
await page.type('#password', 'mypassword');

To click a button on a website, you can use the page.click() method:

await page.click('#mybutton');

Best practice: wait for elements to load

You should wait for website elements to load before data extraction with Playwright. You have several options for waiting for elements to load, including the page.waitForSelector() method and the page.waitForFunction() method.

For example, you can use the following code to delay further action until a certain website component has finished loading:

await page.waitForSelector('#myelement');

This will wait for the element with the id of “myelement” to appear on the website before continuing.

Step 6: Clean up and Exit

After you've finished scraping a website for information, it's best practice to clean up and exit the Playwright instance. You can use the browser.close() method to close the browser and clean up any resources used by Playwright:

await browser.close();

Web scraping with Playwright is that easy!

Playwright’s Data Extraction Capabilities

There are numerous ways to extract data with Playwright. You can use it to scrape text, images, etc.

Here are some of the most significant Playwright data extraction capabilities:

1. Extract text from elements on a website using the page.$eval() method

This method allows you to extract the text of a single element that matches a selector. Here's an example of how to extract a heading’s text on a website:

const headingText = await page.$eval('h1', element => element.textContent); 
console.log(headingText);

This will extract the text of the first h1 element on the page and log it to the console.

2. Extract text from multiple elements on a website using the page.$$eval() method

This technique lets you collect data from multiple elements that share a selector. Here's an example of how to extract the href attributes of all links on a website:

const linkUrls = await page.$$eval('a', elements => elements.map(element => element.href));
console.log(links);

This will extract the href attributes of all <a> elements on the page and log them to the console.

3. Extract text from a website using the page.evaluate() method

This method allows you to extract data from a website using JavaScript. Here's an example of how to extract the text of all h1 elements on a website:

const headingTexts = await page.evaluate(() => { 
const elements = document.querySelectorAll('h1');
return Array.from(elements).map(element => element.textContent);
});
console.log(headingTexts);

This will extract the text of all h1 elements on the page and log them to the console.

4. Extract text from a website using the page.textContent() method

This method allows you to extract the text of a single element that matches a selector. Here's an example of how to extract the text of the first h1 element on a website:

const headingText = await page.textContent('h1');
console.log(headingText);

This will extract the text of the first h1 element on the page and log it to the console.

5. Extract text from a website using the page.innerText() method

This method allows you to extract the text of a single element that matches a selector. Here's an example of how to extract the text of the first h1 element on a website:

const headingText = await page.innerText('h1');
console.log(headingText);

This will extract the text of the first h1 element on the page and log it to the console.

6. Extract images from a website using the page.screenshot() method

This Playwright data extraction method lets you take a screenshot of a website. Here's an example of how to take a screenshot of a website and save it to a file:

await page.screenshot({ path: 'screenshot.png' });

This takes a screenshot of the current page and saves it to a file named screenshot.png.

7. Extract images from a website using the page.pdf() method

This Playwright data extraction method lets you save a website as a PDF. Here's an example of how to save a website as a PDF and save it to a file:

await page.pdf({ path: 'page.pdf' });

This saves the current page as a PDF and saves it to a file named page.pdf.

Conclusion

Playwright is an effective browser automation and web scraping tool. It makes it simple to extract data from websites, interact with websites, and automate complex workflows.

Following straightforward instructions in this Playwright tutorial will help you start using Playwright web scraping to automate browser-based tasks and scrape the web effectively. Remember to use selectors to target elements, wait for elements to load, and clean up after yourself when you're done.

Happy Web Scraping and don't forget to dig deeper into Playwright's documentation to learn more about its capabilities 🤓

Forget about getting blocked while scraping the Web

Try out ScrapingAnt Web Scraping API with thousands of proxy servers and an entire headless Chrome cluster