Skip to main content

How to Use Requests Library with Sessions to Crawl Websites in Python

· 15 min read
Oleg Kulyk

How to Use Requests Library with Sessions to Crawl Websites in Python

Extracting information from websites is an invaluable skill. When utilized, it can support you by collecting vast amounts of data from the internet quickly. Automating data gathering from websites takes away the tedium and time consumed when done manually. This process, popularly known as web scraping, is made significantly more accessible with the Python Requests library.

This guide equips you, whether you’re an experienced developer or relative newcomer, with the required expertise in using Python Requests. From HTTP requests to advanced tasks like authentication, form navigation, and optimization, we cover everything comprehensively, helping you to master web content extraction, web service interaction, and data collection for your projects. The web's wealth of information will be available in no time. Ready? Then, let’s get started.

Python Requests Example

Let's start with a simple example of web crawling using the Python Requests library. This will help us to understand the importance of advanced techniques covered later in the article.

import requests

response = requests.get('https:/example.com')
print(response.text)

What is the Python Requests Library?

The Python Requests library is a powerful tool that simplifies making HTTP requests and handling their responses. It's an essential resource for web developers and data enthusiasts, making interact with web resources more manageable and user-friendly.

In essence, Python Requests acts as a bridge, connecting your Python code with the vast internet landscape. It allows your applications to easily communicate with web servers, request data, and retrieve information. Whether your goal is to extract data from a website, communicate with an API, or automate web-based tasks, Python Requests streamlines these processes for you.

What makes Python Requests stand out from its competitors is its user-friendly design. It abstracts the complexity of HTTP requests, sparing developers from grappling with the intricacies and hassle of the HTTP protocol. This simple navigation and ease of use is why Python’s software is highly valued by both experienced web developers and individuals exploring the world of web data.

How to Use Python Requests Library

Before we dive into advanced web crawling techniques, let's ensure you have the library installed on your system correctly.

Download and Install Python Requests

To install the Python Requests library, open your command prompt and run

pip install requests

Import the Request Module

After installing Requests, import the necessary modules to start using it in your Python code.

import requests

Python Requests Functions

Python Requests offers several unique functions to make various HTTP requests. These requests are the foundation for retrieving the necessary data from web servers. When you want to access a webpage or retrieve information from an API, always turn to your trusty GET request for the very best results.

GET Requests

GET requests are designed for information retrieval and are inherently safe, meaning they don't modify data on the server. Instead, they fetch the requested data without causing any alterations. This makes GET requests ideal for tasks like browsing a website, fetching search results, or accessing publicly available data.

One of the primary features of a GET request is the URL (Uniform Resource Locator). The URL contains the web address of the resource you want to access (You no doubt already know this, but it’s best to cover the basics). When you initiate a GET request, it tells the server, "Hey, I'd like to get this information from that location."

Here's an example of a simple GET request in Python using the Requests library:

import requests

response = requests.get('https://example.com')

In this example, we send a GET request to "https://example.com/data" to retrieve data from that web resource. The response object, aptly named response, holds the data sent back by the server.

GET requests are not only about fetching data but also about the context they carry. They can include query parameters in the URL to specify what exactly you're looking for. For instance, if you're searching for articles about "Python", you might send a GET request like this:

import requests

response = requests.get('https://example.com/search?q=python')

POST Requests

HTTP POST requests complement the GET requests we explored earlier. While GET requests are designed purely for retrieving data, POST requests are all about sending that data to web servers. They play a crucial role in web interactions involving form submissions, data updates, or any operation where you need to send information to the server.

Imagine filling out an online registration form, entering your name, email, and password. When you hit the "Submit" button, a POST request sends your data to the server for processing. This allows the server to create your account or perform the necessary actions based on your provided information. You see, it’s not that complex once it’s explained.

Here's an example of a basic POST request using Python Requests:

import requests

response = requests.post('https://example.com/login', data={'username': 'user', 'password': 'pass'})

In this example, we send a POST request to "https://example.com/login," supplying the server with a 'data' dictionary containing the username and password. The server then processes this data, typically authenticating the user.

Python Response Object's Methods and Attributes

The Response object, as the name suggests, holds a wealth of information about the HTTP response that your request receives. Understanding how to access and utilize the methods and attributes of the Response object is crucial for extracting all that meaningful data you need from those web servers.

Access the Response Methods and Attributes

Now that we've introduced you to the significance of the Response object, let's delve into how to access and leverage its methods and attributes. This will empower you to interact with the data received from web servers most effectively.

Response Content: One of the most common operations you'll perform with the Response object is accessing the response content. This can be HTML text, JSON data, or any other data type. You can retrieve the content using the text attribute:

import requests

response = requests.get('https://example.com')
print(response.text)

The content variable now holds the HTML content of the webpage fetched from "https://example.com." You can process this content, extract information, or save it to a file for future reference.

Response Status Code: HTTP status codes provide essential information about the outcome of your request. You can access the status code using the status_code attribute:

import requests

response = requests.get('https://example.com')
print(response.status_code)

The status_code variable now contains the HTTP status code of the response. For example, a status code 200 indicates a successful request, while 404 signifies that the requested resource was not found.

Response Headers: HTTP headers carry metadata about the response. You can access them as a dictionary using the headers attribute:

import requests

response = requests.get('https://example.com')
print(response.headers)

The headers variable now holds a dictionary of HTTP headers. These headers often contain information such as the content type, server type, and response date, which can be valuable information for understanding the server's behavior.

Response Cookies: Sometimes, web servers use cookies to store session information. You can access and manage these cookies using the cookies attribute:

import requests

response = requests.get('https://example.com')
print(response.cookies)

Feel free to explore all other attributes and methods of the Response object. They can be a valuable resource for understanding the response and extracting the data you need.

Process the Python Response

Once you've obtained the Response object and accessed its attributes, the next crucial step is to effectively process the content it contains. This processing varies depending on whether the response content is in text or JSON format.

Process Text Response

When dealing with textual data, such as HTML from a webpage, processing involves parsing, extracting, and possibly performing transformations. Python offers libraries like BeautifulSoup and LXML for parsing HTML or XML content. You can then use these libraries to navigate the HTML structure and extract the necessary information.

Here's an example of how to process HTML content using BeautifulSoup:

import requests
from bs4 import BeautifulSoup

response = requests.get('https://example.com')
print(response.text)

soup = BeautifulSoup(response.text, 'html.parser')

title = soup.find('title')
print(title.text)

paragraphs = soup.find_all('p')
for paragraph in paragraphs:
print(paragraph.text)

First, we retrieve the HTML content from "https://example.com." We then parse it using BeautifulSoup, making extracting specific elements such as the page title and paragraphs easy.

Processing JSON Content

When working with JSON data, Python provides built-in support for decoding it into native data structures using the .json() method of the Response object. This method parses the JSON content and returns a Python dictionary.

Here's an example of how to process JSON content:

import requests

response = requests.get('https://example.com/data.json')

data = response.json()
print(data['key'])

How to Access the JSON of Python Requests

Follow these steps to extract JSON data effectively:

  • Send a GET request to the API or web service that returns JSON data.
  • Receive the response and store it in a variable, such as response.
  • Use the .json() method on the Response object to decode the JSON content into a Python dictionary.
  • Access specific data within the dictionary using keys, allowing you to retrieve the necessary information.

How to Show the Status Code of a Python Request

To retrieve and display the status code in Python, follow these steps:

  • Send an HTTP request, such as a GET request, using the Requests library.
  • Capture the response in a variable, typically named response.
  • Access the status code using the status_code attribute of the Response object.
  • Display or process the status code as needed for your application.

Here's an example of how to retrieve and display the status code:

import requests

response = requests.get('https://example.com')

print(response.status_code)

How to Get the Main SEO Tags from a Webpage

Here's how you can retrieve and analyze the primary SEO tags using Python Requests and BeautifulSoup:

  • Send a GET request to the webpage URL you want to analyze.
  • Capture the response in a variable, typically named response.
  • Access the HTML content of the response using the text attribute of the Response object.
  • Parse the HTML content with BeautifulSoup, specifying the parser type (e.g., "html.parser" or "lxml").
  • Use BeautifulSoup methods to locate and extract the main SEO tags, such as <title>, <meta>, and <h1>.

Here's a simplified example illustrating how to retrieve the main SEO tags:

import requests
from bs4 import BeautifulSoup

response = requests.get('https://example.com')

soup = BeautifulSoup(response.text, 'html.parser')

title = soup.find('title')
print(title.text)

meta = soup.find('meta', {'name': 'description'})
print(meta['content'])

h1 = soup.find('h1')
print(h1.text)

Gathering all the links on a webpage is a common task in web scraping and data collection. Python Requests and BeautifulSoup make this process simple and straightforward. Here's how you can extract all the links from a webpage using these tools:

  • Send a GET request to the URL of the webpage you want to extract links from.
  • Capture the response in a variable, typically named response.
  • Access the HTML content of the response using the text attribute of the Response object.
  • Parse the HTML content with BeautifulSoup, specifying the parser type (e.g., "html.parser" or "lxml").
  • Use BeautifulSoup methods to locate and extract all the <a> (anchor) tags, which contain the links.

Here's a basic example demonstrating how to extract all links from a webpage:

import requests
from bs4 import BeautifulSoup

response = requests.get('https://example.com')

soup = BeautifulSoup(response.text, 'html.parser')

links = soup.find_all('a')
for link in links:
print(link['href'])

In this example we send a GET request to a webpage, retrieve the HTML content, and parse it with BeautifulSoup. We then use a list comprehension to extract all the links from the <a> tags. The 'links' variable now contains a list of URLs found on the webpage. It’s really that simple.

How to Handle Exception Errors in with Requests

Here are some common exceptions that may occur during requests and techniques for error handling:

  • Request Exceptions: These include exceptions like requests.exceptions.RequestException, which can occur due to network issues or timeouts. You can handle them using try-except blocks and take appropriate actions, like retrying the request or notifying the user.
import requests

try:
response = requests.get('https://example.com')
response.raise_for_status()
except requests.exceptions.RequestException as e:
print(e)
  • HTTP Errors: HTTP errors (e.g., 404 Not Found, 500 Internal Server Error) can be checked using the response.status_code. You can raise exceptions or handle them based on your application's needs.
import requests

response = requests.get('https://example.com')

if response.status_code == 404:
print('Page not found')
elif response.status_code == 500:
print('Internal server error')
  • Connection Errors: These errors can occur when the client cannot connect to the server. Handling them involves catching specific exceptions like requests.exceptions.ConnectionError and deciding on appropriate courses of action.
import requests

try:
response = requests.get('https://example.com')
except requests.exceptions.ConnectionError as e:
print(e)

How to Change User-Agent in Your Python Request

To change the User-Agent header in your Python Requests, you can set a custom user-agent string in the headers of your request:

import requests

response = requests.get('https://example.com', headers={'User-Agent': 'Mozilla/5.0'})

In this example, we define a custom User-Agent header in the headers dictionary and include it when sending the request. This allows you to specify the user-agent string to be used for that particular request.

How to Add Timeouts to a Request in Python

Timeouts are essential when making HTTP requests to ensure that your application does not hang indefinitely while waiting for a response. They allow you to set a maximum time for how long a request should wait for a response before raising an exception. Adding timeouts is crucial to prevent your application from becoming unresponsive in case of network issues or unresponsive servers. So don’t forget them if you want an effective outcome.

To set a timeout for a request in Python Requests, you can always use the timeout parameter when sending the request. The timeout value is specified in seconds:

import requests

try:
response = requests.get('https://example.com', timeout=5)
except requests.exceptions.Timeout as e:
print(e)

How to use Proxies with Python Requests

Moving on to proxies. To use proxies with Python Requests, you can set the proxy configuration in the proxies parameter when sending a request. Here's an example:

import requests

proxies = {
'http': 'http://user:pass@host:port',
'https': 'https://user:pass@host:port'
}

response = requests.get('https://example.com', proxies=proxies)

Check out our in-depth guide about how to use proxies with Python Requests for more information.

Python Requests Sessions

Sessions in Python Requests are a useful feature for maintaining state between multiple HTTP requests. They allow you to persist certain parameters, such as cookies and headers, across multiple requests within the same session.

How to use the Requests Session Feature (Example):

import requests

session = requests.Session()

session.get('https://example.com')
session.get('https://example.com')

print(session.cookies)

This way, you can send multiple requests within the same session, and the cookies will be persisted across all requests. This is useful for web crawling and web scraping, where you need to maintain state across multiple requests.

How to Retry Failed Python Requests

If your Python Requests fails, here’s what to do. Use Exception Handling: Surround your request code with try-except blocks to catch exceptions like requests.exceptions.RequestException. This allows you to detect any failed requests.

Implement Retries: Use libraries like retrying or custom retry logic to retry automatically failed requests with back-off strategies. This helps avoid overloading servers with repeated requests.

Set Maximum Retry Attempts: Define a maximum number of retry attempts to prevent infinite retries and to give up if a request keeps failing. You don’t want to waste your time.

Here's an example of implementing request retries using the retrying library:

import requests
from retrying import retry

@retry(stop_max_attempt_number=3)
def get(url):
response = requests.get(url)
response.raise_for_status()
return response

try:
response = get('https://example.com')
except requests.exceptions.RequestException as e:
print(e)

Other HTTP Methods in the Requests Module

Here are some other HTTP methods supported by Requests:

  • PUT: Used to update or replace an existing resource on the server.

  • DELETE: Used to delete a resource on the server.

  • HEAD: Similar to GET but retrieves only the headers, not the content, which can be useful for checking resource existence.

  • OPTIONS: Used to retrieve information about the communication options for a resource.

  • PATCH: Used to apply partial modifications to a resource

Python Requests Best Practices

To ensure efficient and effective web crawling and API interactions, here are some of the best practices and tips for using Python Requests:

  • Use Sessions: Utilize Requests sessions to persist headers, cookies, and session data across multiple requests.
  • Implement Rate Limiting: Respect rate limits imposed by websites or APIs to avoid being blocked or banned.
  • Error Handling: Implement robust error handling to handle exceptions gracefully, including retries for failed requests.
  • Use Custom Headers: When necessary, set custom headers like User-Agent to mimic different clients or devices.

Conclusion

Python’s Requests library is an indispensable tool for web crawling in Python. With the knowledge you have gained from this guide, you can efficiently fetch data from websites, process responses, handle exceptions, and even log requests. It’s all there for you to utilise.

Remember to follow the best practices and never stop exploring all the various functions and methods at your disposal. Armed with this knowledge, you are well-equipped to tackle web crawling tasks in Python confidently. So what are you waiting for?

Happy web crawling, and don't forget to use the latest library version to get the most out of your web scraping projects 🔝

Forget about getting blocked while scraping the Web

Try out ScrapingAnt Web Scraping API with thousands of proxy servers and an entire headless Chrome cluster