This article will expose how to block specific resources (HTTP requests, CSS, video, images) from loading in Playwright. Playwright is Puppeteer's successor with the ability to control Chromium, Firefox, and Webkit. So I'd call it the second one of the most widely used web scraping and automation tools with headless browser support.
Block resources from loading while web scraping is a widespread technique that allows you to save time and costs.
For example, when you crawl a resource for product information (scrape price, product name, image URL, etc.), you don't need to load external fonts, CSS, videos, and images themselves. However, you'll need to extract text information and direct URLs for media content for most cases.
Also, such improvements will:
- speed up your web scraper
- increase number of pages scraped per minute (you'll pay less for your servers and will be able to get more information for the same infrastructure price)
- decrease proxy bills (you won't use proxy for irrelevant content download)
Since Playwright is a Puppeteer's successor with a similar API, it can be very native to try out using the exact request interception mechanism. Also, from the documentation for both libraries, we can find out the possibility of accessing the page's requests.
So, the output will provide information about the requested resource and its type.
Still, according to Playwright's documentation, the
Request callback object is immutable, so you won't be able to manipulate the request using this callback.
Let's check out the Playwright's suggestion about this situation:
Cool. Let's use
page.route for the request manipulations.
The concept behind using
page.route interception is very similar to Puppeteer's
page.on('request'), but requires indirect access to
Request object using
So, we're using intercepting routes and then indirectly accessing the requests behind these routes.
As a result, you will see the website images not being loaded.
All the supported resource types can be found below:
Also, you can apply any other condition for request prevention, like the resource URL:
Since the start of my web scraping journey, I've found pretty neat the following exclusion list that improves Single-Page Application scrapers and decreases scraping time up to 10x times:
Such code snippet prevents binary and media content loading while providing all required dynamic web page load.
Request interception is a basic web scraping technique that allows improving crawler performance and saving money while doing data extraction at scale.
To save more money, you can check out the web scraping API concept. It already handles headless browser and proxies for you, so you'll forget about giant bills for servers and proxies.
Also, those articles might be interesting for you:
Happy Web Scraping, and don't forget to enable caching in your headless browser 💾