Skip to main content

231 posts tagged with "data extraction"

View All Tags

· 10 min read
Satyam Tripathi

Top Python HTTP Clients for Web Scraping

In the ever-evolving landscape of web scraping, Python remains the language of choice for developers due to its simplicity, readability, and a robust ecosystem of libraries. Python offers a diverse array of HTTP clients that cater to various web scraping needs, from simple data extraction to complex, high-concurrency tasks.

This guide delves into the top Python HTTP clients, exploring their features, pros, cons, and providing code examples to get started.

· 11 min read
Satyam Tripathi

Web Scraping with Playwright Series Part 4 - Avoid Getting Blocked

In Part 3, we focused on analyzing and cleaning the extracted data to address potential issues like missing values, inconsistencies, and outliers. To make it easier for future decision-making, we saved the cleaned data in various formats, such as CSV, databases, and S3 buckets.

In Part 4, we'll delve into strategies for bypassing common web scraping hurdles. We'll explore techniques such as using proxies, rotating user agents, and leveraging web scraping APIs to keep your scraping tasks running smoothly.

Without further ado, let’s get started!

· 22 min read
Satyam Tripathi

Web Scraping with Playwright Series Part 3 - Storing Data

In Part 2, we talked about creating a web scraper with Playwright to extract data from the Nike website, which has dynamically loaded content.

In Part 3, we will focus on carefully analyzing the extracted data and ensuring it's properly cleaned to deal with potential issues like missing values, inconsistencies, and outliers. The cleaned data will then be stored in different formats such as CSV, databases, and S3 buckets to make it easier for future decision-making.

Without further ado, let’s get started!

· 16 min read
Satyam Tripathi

Web Scraping with Playwright Series Part 2 - Building a Scraper

In Part 1, you learned about the basics of Playwright, environment setup, browser launching, and taking screenshots.

In Part 2, you’ll learn how to build a scraper from scratch. We'll cover how to locate and extract data, manage dynamically loaded content, utilize Playwright's network event feature, and improve the scraper's performance by blocking unnecessary resources.

Without further ado, let’s get started!

· 7 min read
Satyam Tripathi

Web Scraping with Playwright Series Part 1 - Getting Started

Introducing the 4-Part Series on Web Scraping with Playwright! This comprehensive series will delve into web scraping using Playwright, a powerful and versatile tool for automating browser interactions.

By the end of this series, you'll have a solid understanding of web scraping with Playwright. You'll be able to build robust scrapers that can handle dynamic content, efficiently store data, and navigate through anti-scraping mechanisms.

In Part 1, you'll learn about the basics of Playwright, why it's useful, how to set up the environment, how to launch the browser using Playwright, and how to take screenshots.

· 17 min read
Oleg Kulyk

Requests vs. HTTPX - A Detailed Comparison

In the realm of Python development, making HTTP requests is a frequent task that requires efficient and reliable libraries. Two prominent libraries, Requests and HTTPX, have been widely adopted by developers for this purpose. Each library has its strengths and weaknesses, making the choice between them dependent on the specific requirements of the project. This research aims to provide a comprehensive comparison between Requests and HTTPX, considering various aspects such as asynchronous support, HTTP/2 compatibility, connection management, error handling, and performance metrics.

Requests, a well-established library, is celebrated for its simplicity and ease of use. It is often the go-to choice for developers who need to make straightforward, synchronous HTTP requests. However, its lack of native support for asynchronous operations and HTTP/2 can be a limitation for high-concurrency applications. On the other hand, HTTPX, a newer library, offers advanced features such as asynchronous support and HTTP/2, making it a more powerful tool for performance-critical applications.

This research will delve into the key feature comparisons and performance metrics of both libraries, providing detailed code examples and explanations. By examining these factors, developers can make an informed decision on which library best suits their needs. This comparison is supported by various benchmarks and source.

· 11 min read
Oleg Kulyk

BeautifulSoup Cheatsheet with Code Samples

BeautifulSoup is a powerful Python library that simplifies the process of web scraping and HTML parsing, making it an essential tool for anyone looking to extract data from web pages. The library allows users to interact with HTML and XML documents in a more human-readable way, facilitating the extraction and manipulation of web data. In this report, we will delve into the core concepts and advanced features of BeautifulSoup, providing detailed code samples and explanations to ensure a comprehensive understanding of the library's capabilities. Whether you're a beginner or an experienced developer, mastering BeautifulSoup will significantly enhance your web scraping projects, making them more efficient and robust.

· 7 min read
Oleg Kulyk

The best Python HTTP clients

Python has emerged as a dominant language due to its simplicity and versatility. One crucial aspect of web development and scraping is making HTTP requests, and Python offers a rich ecosystem of libraries tailored for this purpose.

This report delves into the best Python HTTP clients, exploring their unique features and use cases. From the ubiquitous Requests library, known for its simplicity and ease of use, to the modern and asynchronous HTTPX, which supports the latest protocols like HTTP/2 and WebSockets, there is a tool for every need. Additionally, libraries like aiohttp offer versatile async capabilities, making them ideal for real-time data scraping tasks.

For those requiring low-level control, urllib3 stands out with its robust and flexible features. On the other hand, Uplink provides a declarative approach to API interactions, while GRequests combines the simplicity of Requests with the power of Gevent's asynchronous capabilities. This report also highlights best practices for making HTTP requests and provides a comprehensive guide to efficient web scraping using HTTPX and ScrapingAnt. By understanding the strengths and weaknesses of each library, developers can make informed decisions and choose the best tool for their web scraping and development tasks.

· 10 min read
Oleg Kulyk

How to Ignore SSL Certificate in Python Requests Library

Handling SSL certificate errors is a common task for developers working with Python's Requests library, especially in development and testing environments. SSL certificates are crucial for ensuring secure data transmission over the internet, but there are scenarios where developers may need to bypass SSL verification temporarily. This comprehensive guide explores various methods to ignore SSL certificate errors in Python's Requests library, complete with code examples and best practices. While bypassing SSL verification can be useful in certain circumstances, it is essential to understand the security implications and adopt appropriate safeguards to mitigate risks. This guide covers disabling SSL verification globally, for specific requests, using custom SSL contexts, trusting self-signed certificates, and utilizing environment variables. Additionally, it delves into the security risks associated with ignoring SSL certificate errors and provides best practices for maintaining secure connections.

· 11 min read
Oleg Kulyk

How to Ignore SSL Certificate With Wget

GNU Wget stands out as a powerful tool for retrieving files over HTTP, HTTPS, and FTP protocols. One of its critical features is SSL certificate validation, which ensures secure connections by verifying the authenticity and validity of the SSL/TLS certificates presented by servers. However, there are scenarios where users might need to bypass SSL certificate errors, such as when dealing with self-signed certificates or misconfigured servers. This comprehensive guide delves into various methods of ignoring SSL certificate errors in Wget, from using the --no-check-certificate option to configuring custom CA certificates and employing environment variables. While these techniques offer quick fixes, they come with significant security implications, highlighting the need for a balanced approach that maintains both functionality and security. This report aims to provide an in-depth understanding of these methods, their risks, and best practices to ensure secure and efficient file retrieval using Wget.