Skip to main content

196 posts tagged with "web scraping"

View All Tags

· 7 min read
Satyam Tripathi

Open Source Web Scraping Libraries to Bypass Anti-Bot Systems

Approximately one in five websites targeted for scraping employ advanced anti-bot systems that can easily result in access being blocked. These systems, such as Cloudflare, DataDome, and PerimeterX, are designed to detect and block automated access, making it increasingly difficult for traditional scraping tools to function effectively.

To address these challenges, a variety of open-source libraries have emerged, each offering unique features and techniques to bypass these anti-bot mechanisms.

· 8 min read
Satyam Tripathi

JavaScript vs Python for Web Scraping: Which Is Best?

In the rapidly evolving landscape of web technologies, web scraping has emerged as a crucial tool for data extraction and analysis. As of 2024, two programming languages, JavaScript and Python, stand out as popular choices for developers engaging in web scraping tasks. Each language offers unique strengths and capabilities, making the decision between them a significant consideration for developers at all levels.

· 9 min read
Satyam Tripathi

Playwright vs. Puppeteer in 2024: Which Should You Choose?

In the ever-evolving landscape of web automation and testing, two tools have consistently stood out: Playwright and Puppeteer. As of 2024, both have matured significantly, offering robust features for developers and testers alike. Both tools, developed by teams at Microsoft and Google respectively, offer robust solutions for automating browser tasks, but they cater to slightly different needs and preferences.

· 7 min read
Satyam Tripathi

Playwright vs. Selenium - A Comprehensive Comparison for 2024

In the rapidly evolving landscape of web automation and testing, two open-source frameworks have emerged as leading tools: Playwright and Selenium. Both frameworks offer unique features and capabilities, making the choice between them a nuanced decision that depends on specific project requirements and team expertise.

· 10 min read
Satyam Tripathi

Top Python HTTP Clients for Web Scraping

In the ever-evolving landscape of web scraping, Python remains the language of choice for developers due to its simplicity, readability, and a robust ecosystem of libraries. Python offers a diverse array of HTTP clients that cater to various web scraping needs, from simple data extraction to complex, high-concurrency tasks.

This guide delves into the top Python HTTP clients, exploring their features, pros, cons, and providing code examples to get started.

· 11 min read
Satyam Tripathi

Web Scraping with Playwright Series Part 4 - Avoid Getting Blocked

In Part 3, we focused on analyzing and cleaning the extracted data to address potential issues like missing values, inconsistencies, and outliers. To make it easier for future decision-making, we saved the cleaned data in various formats, such as CSV, databases, and S3 buckets.

In Part 4, we'll delve into strategies for bypassing common web scraping hurdles. We'll explore techniques such as using proxies, rotating user agents, and leveraging web scraping APIs to keep your scraping tasks running smoothly.

Without further ado, let’s get started!

· 11 min read
Satyam Tripathi

How to use Selenium Wire in 2024

Web scraping has become an essential technique for extracting data from websites, especially in an era where data-driven decision-making is paramount. Among the myriad of tools available for web scraping, Selenium stands out due to its ability to interact with web pages like a real user.

However, when it comes to accessing and manipulating network traffic, Selenium's capabilities are limited. This is where Selenium Wire comes into play, offering a powerful extension to the standard Selenium library.

This blog delves into various aspects of Selenium Wire, covering its installation, configuration, and features. It includes details on capturing and modifying HTTP requests, proxy configuration, and advanced request blocking techniques to enhance performance. Additionally, it delves into advanced techniques for request blocking, optimization of performance, and troubleshooting common issues.

· 22 min read
Satyam Tripathi

Web Scraping with Playwright Series Part 3 - Storing Data

In Part 2, we talked about creating a web scraper with Playwright to extract data from the Nike website, which has dynamically loaded content.

In Part 3, we will focus on carefully analyzing the extracted data and ensuring it's properly cleaned to deal with potential issues like missing values, inconsistencies, and outliers. The cleaned data will then be stored in different formats such as CSV, databases, and S3 buckets to make it easier for future decision-making.

Without further ado, let’s get started!

· 16 min read
Satyam Tripathi

Web Scraping with Playwright Series Part 2 - Building a Scraper

In Part 1, you learned about the basics of Playwright, environment setup, browser launching, and taking screenshots.

In Part 2, you’ll learn how to build a scraper from scratch. We'll cover how to locate and extract data, manage dynamically loaded content, utilize Playwright's network event feature, and improve the scraper's performance by blocking unnecessary resources.

Without further ado, let’s get started!

· 7 min read
Satyam Tripathi

Web Scraping with Playwright Series Part 1 - Getting Started

Introducing the 4-Part Series on Web Scraping with Playwright! This comprehensive series will delve into web scraping using Playwright, a powerful and versatile tool for automating browser interactions.

By the end of this series, you'll have a solid understanding of web scraping with Playwright. You'll be able to build robust scrapers that can handle dynamic content, efficiently store data, and navigate through anti-scraping mechanisms.

In Part 1, you'll learn about the basics of Playwright, why it's useful, how to set up the environment, how to launch the browser using Playwright, and how to take screenshots.