265 posts tagged with "web scraping"

JavaScript vs Python for Web Scraping - Which Is Best?

August 31, 2024 · 8 min read

Satyam is a junior data engineer and seasoned blogger. He has created several top-ranked tutorials on different topics like web scraping, automation, and scraping tools. He is always open to working with new technologies in the market and sharing his knowledge.

JavaScript vs Python for Web Scraping: Which Is Best?

In the rapidly evolving landscape of web technologies, web scraping has emerged as a crucial tool for data extraction and analysis. As of 2024, two programming languages, JavaScript and Python, stand out as popular choices for developers engaging in web scraping tasks. Each language offers unique strengths and capabilities, making the decision between them a significant consideration for developers at all levels.

Playwright vs. Puppeteer in 2024 - Which Should You Choose?

August 30, 2024 · 9 min read

Satyam Tripathi

Satyam is a junior data engineer and seasoned blogger. He has created several top-ranked tutorials on different topics like web scraping, automation, and scraping tools. He is always open to working with new technologies in the market and sharing his knowledge.

Playwright vs. Puppeteer in 2024: Which Should You Choose?

In the ever-evolving landscape of web automation and testing, two tools have consistently stood out: Playwright and Puppeteer. As of 2024, both have matured significantly, offering robust features for developers and testers alike. Both tools, developed by teams at Microsoft and Google respectively, offer robust solutions for automating browser tasks, but they cater to slightly different needs and preferences.

Playwright vs. Selenium - A Comprehensive Comparison for 2024

August 29, 2024 · 7 min read

Satyam Tripathi

Satyam is a junior data engineer and seasoned blogger. He has created several top-ranked tutorials on different topics like web scraping, automation, and scraping tools. He is always open to working with new technologies in the market and sharing his knowledge.

Playwright vs. Selenium - A Comprehensive Comparison for 2024

In the rapidly evolving landscape of web automation and testing, two open-source frameworks have emerged as leading tools: Playwright and Selenium. Both frameworks offer unique features and capabilities, making the choice between them a nuanced decision that depends on specific project requirements and team expertise.

Top Python HTTP Clients for Web Scraping

August 29, 2024 · 10 min read

Satyam Tripathi

Satyam is a junior data engineer and seasoned blogger. He has created several top-ranked tutorials on different topics like web scraping, automation, and scraping tools. He is always open to working with new technologies in the market and sharing his knowledge.

Top Python HTTP Clients for Web Scraping

In the ever-evolving landscape of web scraping, Python remains the language of choice for developers due to its simplicity, readability, and a robust ecosystem of libraries. Python offers a diverse array of HTTP clients that cater to various web scraping needs, from simple data extraction to complex, high-concurrency tasks.

This guide delves into the top Python HTTP clients, exploring their features, pros, cons, and providing code examples to get started.

Web Scraping with Playwright Series Part 4 - Avoid Getting Blocked

August 29, 2024 · 11 min read

Satyam Tripathi

Satyam is a junior data engineer and seasoned blogger. He has created several top-ranked tutorials on different topics like web scraping, automation, and scraping tools. He is always open to working with new technologies in the market and sharing his knowledge.

Web Scraping with Playwright Series Part 4 - Avoid Getting Blocked

In Part 3, we focused on analyzing and cleaning the extracted data to address potential issues like missing values, inconsistencies, and outliers. To make it easier for future decision-making, we saved the cleaned data in various formats, such as CSV, databases, and S3 buckets.

In Part 4, we'll delve into strategies for bypassing common web scraping hurdles. We'll explore techniques such as using proxies, rotating user agents, and leveraging web scraping APIs to keep your scraping tasks running smoothly.

Without further ado, let’s get started!

How to use Selenium Wire in 2024

August 27, 2024 · 11 min read

Satyam Tripathi

Satyam is a junior data engineer and seasoned blogger. He has created several top-ranked tutorials on different topics like web scraping, automation, and scraping tools. He is always open to working with new technologies in the market and sharing his knowledge.

How to use Selenium Wire in 2024

Web scraping has become an essential technique for extracting data from websites, especially in an era where data-driven decision-making is paramount. Among the myriad of tools available for web scraping, Selenium stands out due to its ability to interact with web pages like a real user.

However, when it comes to accessing and manipulating network traffic, Selenium's capabilities are limited. This is where Selenium Wire comes into play, offering a powerful extension to the standard Selenium library.

This blog delves into various aspects of Selenium Wire, covering its installation, configuration, and features. It includes details on capturing and modifying HTTP requests, proxy configuration, and advanced request blocking techniques to enhance performance. Additionally, it delves into advanced techniques for request blocking, optimization of performance, and troubleshooting common issues.

Web Scraping with Playwright Series Part 3 - Storing Data

August 26, 2024 · 22 min read

Satyam Tripathi

Satyam is a junior data engineer and seasoned blogger. He has created several top-ranked tutorials on different topics like web scraping, automation, and scraping tools. He is always open to working with new technologies in the market and sharing his knowledge.

Web Scraping with Playwright Series Part 3 - Storing Data

In Part 2, we talked about creating a web scraper with Playwright to extract data from the Nike website, which has dynamically loaded content.

In Part 3, we will focus on carefully analyzing the extracted data and ensuring it's properly cleaned to deal with potential issues like missing values, inconsistencies, and outliers. The cleaned data will then be stored in different formats such as CSV, databases, and S3 buckets to make it easier for future decision-making.

Without further ado, let’s get started!

Web Scraping with Playwright Series Part 2 - Building a Scraper

August 25, 2024 · 16 min read

Satyam Tripathi

Satyam is a junior data engineer and seasoned blogger. He has created several top-ranked tutorials on different topics like web scraping, automation, and scraping tools. He is always open to working with new technologies in the market and sharing his knowledge.

Web Scraping with Playwright Series Part 2 - Building a Scraper

In Part 1, you learned about the basics of Playwright, environment setup, browser launching, and taking screenshots.

In Part 2, you’ll learn how to build a scraper from scratch. We'll cover how to locate and extract data, manage dynamically loaded content, utilize Playwright's network event feature, and improve the scraper's performance by blocking unnecessary resources.

Without further ado, let’s get started!

Web Scraping with Playwright Series Part 1 - Getting Started

August 24, 2024 · 7 min read

Satyam Tripathi

Satyam is a junior data engineer and seasoned blogger. He has created several top-ranked tutorials on different topics like web scraping, automation, and scraping tools. He is always open to working with new technologies in the market and sharing his knowledge.

Web Scraping with Playwright Series Part 1 - Getting Started

Introducing the 4-Part Series on Web Scraping with Playwright! This comprehensive series will delve into web scraping using Playwright, a powerful and versatile tool for automating browser interactions.

By the end of this series, you'll have a solid understanding of web scraping with Playwright. You'll be able to build robust scrapers that can handle dynamic content, efficiently store data, and navigate through anti-scraping mechanisms.

In Part 1, you'll learn about the basics of Playwright, why it's useful, how to set up the environment, how to launch the browser using Playwright, and how to take screenshots.

Requests vs. HTTPX - A Detailed Comparison

August 23, 2024 · 17 min read

Oleg Kulyk

Co-Founder @ ScrapingAnt

Requests vs. HTTPX - A Detailed Comparison

In the realm of Python development, making HTTP requests is a frequent task that requires efficient and reliable libraries. Two prominent libraries, Requests and HTTPX, have been widely adopted by developers for this purpose. Each library has its strengths and weaknesses, making the choice between them dependent on the specific requirements of the project. This research aims to provide a comprehensive comparison between Requests and HTTPX, considering various aspects such as asynchronous support, HTTP/2 compatibility, connection management, error handling, and performance metrics.

Requests, a well-established library, is celebrated for its simplicity and ease of use. It is often the go-to choice for developers who need to make straightforward, synchronous HTTP requests. However, its lack of native support for asynchronous operations and HTTP/2 can be a limitation for high-concurrency applications. On the other hand, HTTPX, a newer library, offers advanced features such as asynchronous support and HTTP/2, making it a more powerful tool for performance-critical applications.

This research will delve into the key feature comparisons and performance metrics of both libraries, providing detailed code examples and explanations. By examining these factors, developers can make an informed decision on which library best suits their needs. This comparison is supported by various benchmarks and source.