In this article, we will learn how to create a simple e-commerce search API with multiple platform support: eBay and Amazon. AutoScraper and FastAPi provide the ability to create a powerful JSON API for the date. With Playwright's help, we'll extend our scraper and avoid blocking by using ScrapingAnt's web scraping API.
As you know, Puppeteer is a high-level API to control headless Chrome, and it's probably one of the most popular web scraping tools on the Internet. The only problem is that an average web developer might be overloaded by tons of possible settings for a proper web scraping setup.
I want to share 6 handy and pretty obvious tricks that should help web developers to increase web scraper success rate, improve performance and avoid bans.
Web scraping a website with the actually supported or other browsers has a real benefit in ensuring that the scraper will not be banned by the fingerprint or the behavioral pattern. Playwright already provides full support for Chromium, Firefox, and WebKit out of the box without installing the browsers manually, but since most of the users out there use Google Chrome or Microsoft Edge instead of the open-source Chromium variant, in some scenarios, it's safer to use them to emulate a more realistic browser environment.
Please, don't consider this article too serious.
While playing around machine learning, we've found pretty interesting white paper about GPT-2. Let's find out what it can generate about web scraping!
Web sites are written using HTML, which means that each web page is a structured document. Sometimes the goal is to obtain some data from them and preserve the structure while we’re at it. Websites don’t always provide their data in comfortable formats such as CSV or JSON, so only the way to deal with it is to parse the HTML page.
HTML is a simply structured markup language and everyone who is going to write a web scraper should deal with HTML parsing. The goal of this article is to help you find the right tool for HTML processing.