Puppeteer is a high-level API to control headless Chrome. Most things that you can do manually in the browser can be done using Puppeteer, so it quickly became one of the most popular web scraping tool in Node.js and Python. Many developers use it for a single page applications (SPA) data extraction as it allows executing client-side Javascript. In this article, we are going to show how to set up a proxy in Puppeteer and how to spin up your own rotating proxy server.
186 posts tagged with "web scraping"
View All TagsHow to use Microsoft Edge with Playwright
Web scraping a website with the actually supported or other browsers has a real benefit in ensuring that the scraper will not be banned by the fingerprint or the behavioral pattern. Playwright already provides full support for Chromium, Firefox, and WebKit out of the box without installing the browsers manually, but since most of the users out there use Google Chrome or Microsoft Edge instead of the open-source Chromium variant, in some scenarios, it's safer to use them to emulate a more realistic browser environment.
GPT-2 answers what is Web Scraping
Please, don't consider this article too serious.
While playing around machine learning, we've found pretty interesting white paper about GPT-2. Let's find out what it can generate about web scraping!
HTML Parsing Libraries - C#
Web sites are written using HTML, which means that each web page is a structured document. Sometimes the goal is to obtain some data from them and preserve the structure while we’re at it. Websites don’t always provide their data in comfortable formats such as CSV or JSON, so only the way to deal with it is to parse the HTML page.
HTML Parsing Libraries - Java
HTML is a simply structured markup language and everyone who is going to write a web scraper should deal with HTML parsing. The goal of this article is to help you find the right tool for HTML processing.
How to Collect Data from TikTok
There is a lot of news related to TikTok being sold to US companies and the issue of scraping TikTok data becomes more real due to the possible closing of the service.
Web browser automation with Python and Playwright
In this article, we'd like to share the current state of Playwright integration with Python and several helpful code snippets for understanding the code techniques.
HTML Parsing Libraries - JavaScript
HTML is a simple structured markup language and everyone who is going to write the web scraper should deal with HTML parsing. The goal of this article is to help you to find the right tool for HTML processing. We are not going to present libraries for more specific tasks, such as article extractors, product extractors, or web scrapers.
Open Source Javascript Web Scraping
In this article, I’d like to list some most popular Javascript open-source projects that can be useful for web scraping. It consists of both libraries and standalone niche scrapers that can scrape a particular site (Amazon, iTunes, Instagram, Google Play, etc.)
Scraping with millions of browsers or Puppeteer Cluster
In this article, we’d like to introduce an awesome open-source Web Scraping solution for running a pool of Chromium instances using Puppeteer.