This article will expose how to block specific resources (HTTP requests, CSS, video, images) from loading in Playwright. Playwright is Puppeteer's successor with the ability to control Chromium, Firefox, and Webkit. So I'd call it the second one of the most widely used web scraping and automation tools with headless browser support.
In this article, we'll take a look at how to submit forms using Playwright. This knowledge might be beneficial while scraping the web, as it allows to get the information from the target web page, which requires providing parameters before.
In this article, we will share several ideas on how to download files with Playwright. Automating file downloads can sometimes be confusing. You need to handle a download location, download multiple files simultaneously, support streaming, and even more. Unfortunately, not all the cases are well documented. Let's go through several examples and take a deep dive into Playwright's APIs used for file download.
Dynamic languages are helpful tools for web scraping. Scripting allows users to rapidly tie together complex systems or libraries and express ideas without dealing with memory management or build systems.
As you know, Puppeteer is a high-level API to control headless Chrome, and it's probably one of the most popular web scraping tools on the Internet. The only problem is that an average web developer might be overloaded by tons of possible settings for a proper web scraping setup.
I want to share 6 handy and pretty obvious tricks that should help web developers to increase web scraper success rate, improve performance and avoid bans.
Web scraping a website with the actually supported or other browsers has a real benefit in ensuring that the scraper will not be banned by the fingerprint or the behavioral pattern. Playwright already provides full support for Chromium, Firefox, and WebKit out of the box without installing the browsers manually, but since most of the users out there use Google Chrome or Microsoft Edge instead of the open-source Chromium variant, in some scenarios, it's safer to use them to emulate a more realistic browser environment.
There is a lot of news related to TikTok being sold to US companies and the issue of scraping TikTok data becomes more real due to the possible closing of the service.
HTML is a simple structured markup language and everyone who is going to write the web scraper should deal with HTML parsing. The goal of this article is to help you to find the right tool for HTML processing. We are not going to present libraries for more specific tasks, such as article extractors, product extractors, or web scrapers.
In this article, we’d like to introduce an awesome open-source Web Scraping solution for running a pool of Chromium instances using Puppeteer.
In this article, I’d like to share a quick guide of how to run Playwright inside AWS Lambda. There are a bunch of similar guides about Puppeteer, but only a few are about the successor from Microsoft.
JS is a quite well-known language with a great spread and community support. It can be used for both client and server web scraping scripting that makes it pretty suitable for writing your scrapers and crawlers.
Most of these libraries' advantages can be received by web scraping API and some of these libraries can be used in stack with it.
So let’s check them out.
AngularJS is a quite common framework for building modern Single Page Applications, but what about the ability to scrape sites based on it? Let’s find out.
In the current article, I’d like to share my experience with Amazon products scraping. The well-known Amazon marketplace offers the best deals for thousands of product types and from thousands of sellers. The potential amount of data to scrape is quite insane and can be used for:
- Market price comparison
- Price change tracking
- Analyzing product reviews
- Copyright check
- Finding the best products for selling or dropshipping
- A lot of data science and machine learning stuff