Skip to main content

199 posts tagged with "web scraping"

View All Tags

· 22 min read
Satyam Tripathi

Web Scraping with Playwright Series Part 3 - Storing Data

In Part 2, we talked about creating a web scraper with Playwright to extract data from the Nike website, which has dynamically loaded content.

In Part 3, we will focus on carefully analyzing the extracted data and ensuring it's properly cleaned to deal with potential issues like missing values, inconsistencies, and outliers. The cleaned data will then be stored in different formats such as CSV, databases, and S3 buckets to make it easier for future decision-making.

Without further ado, let’s get started!

· 16 min read
Satyam Tripathi

Web Scraping with Playwright Series Part 2 - Building a Scraper

In Part 1, you learned about the basics of Playwright, environment setup, browser launching, and taking screenshots.

In Part 2, you’ll learn how to build a scraper from scratch. We'll cover how to locate and extract data, manage dynamically loaded content, utilize Playwright's network event feature, and improve the scraper's performance by blocking unnecessary resources.

Without further ado, let’s get started!

· 7 min read
Satyam Tripathi

Web Scraping with Playwright Series Part 1 - Getting Started

Introducing the 4-Part Series on Web Scraping with Playwright! This comprehensive series will delve into web scraping using Playwright, a powerful and versatile tool for automating browser interactions.

By the end of this series, you'll have a solid understanding of web scraping with Playwright. You'll be able to build robust scrapers that can handle dynamic content, efficiently store data, and navigate through anti-scraping mechanisms.

In Part 1, you'll learn about the basics of Playwright, why it's useful, how to set up the environment, how to launch the browser using Playwright, and how to take screenshots.

· 17 min read
Oleg Kulyk

Requests vs. HTTPX - A Detailed Comparison

In the realm of Python development, making HTTP requests is a frequent task that requires efficient and reliable libraries. Two prominent libraries, Requests and HTTPX, have been widely adopted by developers for this purpose. Each library has its strengths and weaknesses, making the choice between them dependent on the specific requirements of the project. This research aims to provide a comprehensive comparison between Requests and HTTPX, considering various aspects such as asynchronous support, HTTP/2 compatibility, connection management, error handling, and performance metrics.

Requests, a well-established library, is celebrated for its simplicity and ease of use. It is often the go-to choice for developers who need to make straightforward, synchronous HTTP requests. However, its lack of native support for asynchronous operations and HTTP/2 can be a limitation for high-concurrency applications. On the other hand, HTTPX, a newer library, offers advanced features such as asynchronous support and HTTP/2, making it a more powerful tool for performance-critical applications.

This research will delve into the key feature comparisons and performance metrics of both libraries, providing detailed code examples and explanations. By examining these factors, developers can make an informed decision on which library best suits their needs. This comparison is supported by various benchmarks and source.

· 11 min read
Oleg Kulyk

BeautifulSoup Cheatsheet with Code Samples

BeautifulSoup is a powerful Python library that simplifies the process of web scraping and HTML parsing, making it an essential tool for anyone looking to extract data from web pages. The library allows users to interact with HTML and XML documents in a more human-readable way, facilitating the extraction and manipulation of web data. In this report, we will delve into the core concepts and advanced features of BeautifulSoup, providing detailed code samples and explanations to ensure a comprehensive understanding of the library's capabilities. Whether you're a beginner or an experienced developer, mastering BeautifulSoup will significantly enhance your web scraping projects, making them more efficient and robust.

· 7 min read
Oleg Kulyk

The best Python HTTP clients

Python has emerged as a dominant language due to its simplicity and versatility. One crucial aspect of web development and scraping is making HTTP requests, and Python offers a rich ecosystem of libraries tailored for this purpose.

This report delves into the best Python HTTP clients, exploring their unique features and use cases. From the ubiquitous Requests library, known for its simplicity and ease of use, to the modern and asynchronous HTTPX, which supports the latest protocols like HTTP/2 and WebSockets, there is a tool for every need. Additionally, libraries like aiohttp offer versatile async capabilities, making them ideal for real-time data scraping tasks.

For those requiring low-level control, urllib3 stands out with its robust and flexible features. On the other hand, Uplink provides a declarative approach to API interactions, while GRequests combines the simplicity of Requests with the power of Gevent's asynchronous capabilities. This report also highlights best practices for making HTTP requests and provides a comprehensive guide to efficient web scraping using HTTPX and ScrapingAnt. By understanding the strengths and weaknesses of each library, developers can make informed decisions and choose the best tool for their web scraping and development tasks.

· 10 min read
Oleg Kulyk

How to Ignore SSL Certificate in Python Requests Library

Handling SSL certificate errors is a common task for developers working with Python's Requests library, especially in development and testing environments. SSL certificates are crucial for ensuring secure data transmission over the internet, but there are scenarios where developers may need to bypass SSL verification temporarily. This comprehensive guide explores various methods to ignore SSL certificate errors in Python's Requests library, complete with code examples and best practices. While bypassing SSL verification can be useful in certain circumstances, it is essential to understand the security implications and adopt appropriate safeguards to mitigate risks. This guide covers disabling SSL verification globally, for specific requests, using custom SSL contexts, trusting self-signed certificates, and utilizing environment variables. Additionally, it delves into the security risks associated with ignoring SSL certificate errors and provides best practices for maintaining secure connections.

· 11 min read
Oleg Kulyk

How to Ignore SSL Certificate With Wget

GNU Wget stands out as a powerful tool for retrieving files over HTTP, HTTPS, and FTP protocols. One of its critical features is SSL certificate validation, which ensures secure connections by verifying the authenticity and validity of the SSL/TLS certificates presented by servers. However, there are scenarios where users might need to bypass SSL certificate errors, such as when dealing with self-signed certificates or misconfigured servers. This comprehensive guide delves into various methods of ignoring SSL certificate errors in Wget, from using the --no-check-certificate option to configuring custom CA certificates and employing environment variables. While these techniques offer quick fixes, they come with significant security implications, highlighting the need for a balanced approach that maintains both functionality and security. This report aims to provide an in-depth understanding of these methods, their risks, and best practices to ensure secure and efficient file retrieval using Wget.

· 16 min read
Oleg Kulyk

Scrape a Dynamic Website with C++

Web scraping has become an indispensable tool for acquiring data from websites, especially in the era of big data and data-driven decision-making. However, the complexity of scraping has increased with the advent of dynamic websites, which generate content on-the-fly using JavaScript and AJAX. Unlike static websites, which serve pre-built HTML pages, dynamic websites respond to user interactions and real-time data updates, making traditional scraping techniques ineffective.

To navigate this landscape, developers need to understand the intricacies of client-side and server-side rendering, the role of JavaScript frameworks such as React, Angular, and Vue.js, and the importance of AJAX for asynchronous data loading. This knowledge is crucial for choosing the right tools and techniques to effectively scrape dynamic websites. In this report, we delve into the methodologies for scraping dynamic websites using C++, exploring essential libraries like libcurl, Gumbo, and Boost, and providing a detailed, step-by-step guide to building robust web scrapers.

· 16 min read
Oleg Kulyk

Scrape a Dynamic Website with C#

Dynamic websites have become increasingly prevalent due to their ability to deliver personalized and interactive content to users. Unlike static websites, which serve pre-built HTML pages, dynamic websites generate content on-the-fly based on user interactions, database queries, or real-time data. This dynamic nature is achieved through the use of server-side programming languages such as PHP, Ruby, and Python, as well as client-side JavaScript frameworks like React, Angular, and Vue.js.

Dynamic websites are characterized by asynchronous content loading, client-side rendering, real-time updates, personalized content, and complex DOM structures. These features enhance user experience but also introduce significant challenges for web scraping. Traditional scraping tools that rely on static HTML parsing often fall short when dealing with dynamic websites, necessitating the use of more sophisticated methods and tools.

To effectively scrape dynamic websites using C#, developers must employ specialized tools such as Selenium WebDriver and PuppeteerSharp, which can interact with web pages as if they were real users, executing JavaScript and waiting for content to load. These tools, along with proper wait mechanisms and dynamic element location strategies, enable the extraction of data from even the most complex and interactive web applications.