Skip to main content

How to Create a Proxy Server in Python Using Proxy.py

· 15 min read
Oleg Kulyk

How to Create a Proxy Server in Python Using Proxy.py

You can be one of two groups of web developers:

  1. Developers who get blocked when web scraping
  2. Developers who use proxy servers to hide their IP and easily extract the data they want

If you’re in group 2, then you make it harder for websites or services to track your online activity. You will be able to bypass regional restrictions and access content that might otherwise be unavailable. You can even filter and inspect incoming and outgoing traffic to protect against malicious requests or unauthorized access attempts.

In this article, we’ll explain how to use the proxy.py library so you will be firmly set to be in group 2. Let’s not waste any more time and get straight to it.

Understanding the Python proxy.py Library

Python's proxy.py is a Python-based open-source proxy server framework. It’s a library that provides the tools and infrastructure to build and configure proxy servers with specific functionalities.

The capabilities of proxy.py include:

  • HTTP/HTTPS proxying: Proxy.py can be used to proxy HTTP and HTTPS requests.
  • SOCKS4/SOCKS5 proxying: it can also be used to proxy SOCKS4 and SOCKS5 requests.
  • SSL interception and inspection: developers can use proxy.py to intercept and inspect SSL traffic.
  • Traffic filtering: all the traffic passing through the proxy server can be filtered and modified.
  • Traffic recording: it can be used to record all the traffic passing through the proxy server.

Our journey through this article will take us from the foundational aspects of proxy servers to their extended usage cases. We’ll also cover how to create a proxy server in Python using proxy.py. Whether you are looking to maintain anonymity, manage multiple requests, or circumvent geo-restrictions, understanding how to create and utilize a proxy server is an invaluable skill in your data extraction toolkit.

What Is a Proxy Server?

A proxy server acts as an intermediary between a user's computer and the internet. It is a server (a computer system or an application) that serves as a gateway between a user and the internet.

When using a proxy server, internet traffic flows through the proxy server on its way to the address you requested. The request then comes back through that same proxy server, and then the proxy server forwards the data received from the website to you. This process ensures that your direct IP address is not exposed to the target server, thereby offering a level of anonymity and security.

Importance of Proxy Servers in Web Scraping

In the world of web scraping, proxy servers are essential tools for web scraping. They are used to hide the IP address of the scraper and make it appear as if the requests are coming from a different location.

Main reasons why proxy servers are important in web scraping include:

  • Anonymity: They mask your IP address, making it difficult for target websites to determine the origin of the scrape request. This is crucial for data extraction specialists who need to gather data without revealing their identity or location.
  • Avoiding IP Bans and Rate Limits: Websites often track the IP addresses of visitors and may block those that appear to make unusually high numbers of requests. By using different proxy servers, web scrapers can avoid being detected and banned by these websites.
  • Geo-Specific Data Extraction: Proxy servers can also be used to access content that is geo-restricted. By routing your requests through a proxy server located in a specific geographical region, you can access and scrape content that is otherwise unavailable in your location.
  • Balancing Load and Reducing Latency: Using multiple proxy servers can distribute the load of your scraping activities, reducing the risk of overloading any single server and potentially reducing latency.

In scope of anonymity, there are 3 types of proxy servers:

  • Transparent Proxy: These proxies pass along your IP address to the web server. While they don't provide anonymity, they are useful for content caching, controlling internet usage, and overcoming simple IP bans.
  • Anonymous Proxy: This type of proxy does not transmit your IP address to the target server. It provides a significant level of anonymity, making it a popular choice for web scraping, especially when gathering publicly accessible data without revealing the scraper's identity.
  • High Anonymity (Elite) Proxy: High anonymity proxies offer the highest level of privacy and security. They do not only hide your IP address from the target server but also do not identify themselves as proxies. These are particularly useful for highly sensitive scraping tasks where maximum anonymity is a priority.

It's a quite brief overview of proxy servers. If you want to learn more about them, check out our guide to proxy servers.

Getting Started with Proxy.py

Installation of Proxy.py

To begin using proxy.py, you first need to install it. This can be easily done using Python's package manager, pip. Open your terminal or command prompt and run the following command:

pip install proxy.py

Still, there are other installation options available. You can check them out using official Github page section Installation.

Basic Setup and Configuration

Once proxy.py is installed, you can start configuring it for basic usage. To start the proxy server with default settings, simply run:

proxy

This command starts a proxy server on your local machine. By default, the server runs on 127.0.0.1 (localhost) and port 8899. You can change these defaults by specifying the IP address and port as command-line arguments. For example:

proxy --hostname 0.0.0.0 --port 9000

Creating a Simple Proxy Server

Now that you have Proxy.py running, you can set up a basic proxy server. Here's a simple example to demonstrate this:

  • Configure Your Web Scraper: In your web scraping script, configure the HTTP or HTTPS requests to route through the proxy server. For instance, if you're using Python's requests library, you can do this by:
import requests

proxies = {
'http': 'http://127.0.0.1:8899',
'https': 'http://127.0.0.1:8899',
}
response = requests.get('http://example.com', proxies=proxies)
print(response.text)

This script routes the request through the proxy server running on localhost at port 8899.

You can learn more about requests library and how to use it in our guide to web scraping with Python Requests.

  • Testing and Verification: After setting up your web scraper to use the proxy server, run your script. If everything is set up correctly, your web scraper should successfully fetch data via the proxy server. You can verify this by checking the logs of proxy.py, where you should see the incoming requests.

This setup is sufficient for simple web scraping tasks. However, for more advanced requirements, such as handling SSL requests or setting up rotating proxies, you will need to delve into more complex configurations and perhaps write custom plugins, which proxy.py robustly supports.

Advanced Proxy Server Configuration

proxy.py is a highly configurable library that can be customized to suit your specific needs. These advanced features make it ideal for complex web scraping tasks. Let's delve into some of these features.

  • Threading: proxy.py can handle multiple requests simultaneously, thanks to its threading capabilities. By default, it operates in a non-blocking event-driven I/O model, which is efficient for handling high concurrency. However, you can enable threading if needed, using the --threaded flag when starting the proxy. This is particularly useful when you need to handle a large number of simultaneous connections.
  • SSL Interception: For web scraping HTTPS sites, SSL interception is a crucial feature. Proxy.py can be configured to decrypt SSL traffic. To enable SSL interception, you need to generate and provide the path to a CA (Certificate Authority) certificate file and a private key file. This allows the proxy to decrypt, inspect, and re-encrypt SSL traffic, which is vital for scraping HTTPS websites.
  • Plugin System: One of the most powerful features of Proxy.py is its plugin-based architecture. Plugins allow you to extend the functionality of the proxy server. You can write custom plugins for tasks like modifying request headers, logging, request filtering, or even building a custom caching layer. Plugins are Python classes that can hook into various stages of the request/response cycle.

To use a plugin, you simply need to pass it as a command-line argument while starting the proxy server. For example:

proxy --plugins plugin_module.PluginClassName

Customization is where proxy.py truly shines, especially for complex web scraping tasks. Here’s how you can customize it:

  • Custom Code: Write plugins to handle specific tasks, like rotating IPs, modifying user agents, handling CAPTCHAs, or managing sessions. This level of customization allows you to tailor the proxy server to your specific scraping needs.
  • Handling Complex Scenarios: For instance, if you need to scrape a website that has rate-limiting, you could write a plugin to rotate IP addresses or add delays between requests.
  • Performance Optimization: You can optimize the performance of proxy.py by tweaking its configuration parameters, such as the number of threads, enabling or disabling certain plugins based on the task, and adjusting timeout settings.
  • Security Enhancements: If you are dealing with sensitive data, you can enhance the security of your proxy server by implementing encryption plugins or plugins that anonymize data.

In summary, the advanced features of proxy.py make it a highly flexible tool for web scraping professionals. Its ability to handle high concurrency, support for SSL interception, and a robust plugin system allow it to be adapted for a wide range of scraping scenarios, from simple data collection tasks to complex and large-scale data extraction projects.

One of such complex scenarios is web scraping with rotating proxies. Let's take a look at how to do it.

How to Make Rotating Proxies with Proxy.py

Rotating proxies are a system where each request is sent through a different proxy server, cycling through a pool of IP addresses.

To implement rotating proxies in proxy.py, you will primarily use its plugin system. Below is a simplified guide to creating a custom plugin for IP rotation:

  • Create a List of Proxy IPs: Start with a list of available proxy IP addresses and ports that you will rotate through. These could be your own proxy servers created using proxy.py and running on some servers or acquired from a third-party service.
  • Create a Custom Plugin: Create a custom plugin that will handle the IP rotation. This plugin will be responsible for selecting a proxy IP from the list and routing the request through it. Here’s a skeleton of what the plugin might look like:
import random
from proxy.http.proxy import HttpProxyBasePlugin

class RotatingProxyPlugin(HttpProxyBasePlugin):
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.proxy_list = [
'127.0.0.1:8899',
'127.0.0.1:8900',
# ... other proxy IPs
] # Your list of proxies


def before_upstream_connection(self, request):
# Select a random proxy from the list
proxy = random.choice(self.proxy_list)
# Parse the proxy address
proxy_host, proxy_port = proxy.split(':')
# Set the proxy for the request
request.host = proxy_host
request.port = int(proxy_port)
return request

# Implement other necessary methods if needed

This implementation modifies the before_upstream_connection method to choose a random proxy from the proxy_list and set the host and port of the request accordingly. The split method is used to separate the host and port, and the port is converted to an integer since the request.port expects an integer value.

  • Run Proxy.py with Your Plugin: Start proxy.py and specify your plugin:
proxy --plugins path.to.RotatingProxyPlugin

Such a plugin is a great start for your own rotating proxy server. For example, you can fill your list with free proxies and use them for web scraping. Still, we encourage using high-quality resources for production purposes, as free proxies are not safe and reliable.

Room for improvement

The above implementation is a simple example of how to create a rotating proxy server using proxy.py. However, it can be improved in several ways:

  • Handling Errors: The above implementation does not handle errors. For instance, if a proxy server is down, the request will fail. You can improve this by adding error handling to the plugin.
  • Handling Proxy Reputation for the Target Host: Some websites may block requests from certain proxy servers. You can improve the plugin by adding a mechanism to track the reputation of each proxy server for the target host and avoid using blacklisted proxies.
  • Handling Proxy Authentication: If your proxy servers require authentication, you can add support for it in the plugin. The same is applicable for your rotating proxy server. If you want to use this server in the real world, you will need to add support for proxy authentication.
  • Handling Proxy Rotation Frequency: You can add a mechanism to control the frequency of proxy rotation. For instance, you can set a minimum time interval between each rotation to avoid overloading the proxy servers.

Please note that this code assumes all proxies in your list are HTTP proxies. If you're using a mix of HTTP and HTTPS proxies, or if there are other specific requirements, you may need to adjust the logic accordingly. Additionally, remember to handle possible exceptions and edge cases, such as an empty proxy list or invalid proxy formats, in a production environment.

You can get more inspiration for writing your own plugins by observing the existing plugins in the proxy.py repository. You can find them inside proxy/plugin directory at Github. Also, examples directory contains some useful examples of using proxy.py in real-world scenarios.

Security and Ethical Considerations in Using Proxy Servers for Web Scraping

Ensure you comply with the laws and regulations when web scraping. Here are some tips to help you use proxies in scraping legally:

  • Compliance with Legal Regulations: Always ensure that your use of proxy servers complies with local, national, and international laws. This includes respecting privacy laws, data protection regulations, and specific terms of service of websites.
  • Respecting Target Websites' Terms of Service: Many websites have specific terms of service that prohibit scraping or certain types of access. It's important to review and adhere to these terms to avoid legal issues.
  • Rate Limiting: Implement rate limiting in your scraping activities. Bombarding a website with too many requests in a short period can cause strain on their servers, which might be considered a denial-of-service attack.
  • Data Privacy and Security: When handling sensitive data, ensure that you have robust security measures in place. This includes encrypting data in transit and at rest, and ensuring that only authorized personnel have access to it.
  • Transparency and Accountability: If you're collecting data for research or other purposes, be transparent about your methods and intentions. Ensure that your activities can withstand ethical scrutiny.

Web scraping with proxies includes potential risks, which still can be mitigated:

  • Risk of Detection and Blocking: Even with proxies, there's a risk of being detected and blocked by target websites. To mitigate this, use a rotating pool of proxies, implement appropriate request headers, and randomize request timings.
  • Proxy Server Security: Using third-party proxy servers can pose a security risk, as the data passing through the proxy can be intercepted. To mitigate this, use trusted proxy providers, or better yet, set up your own private proxies.
  • Legal and Ethical Risks: Unauthorized scraping or bypassing access controls can lead to legal actions. Always ensure that your scraping activities are legal and ethical. If in doubt, seek legal advice.
  • Data Quality and Integrity: Data obtained through proxies, especially public ones, can be manipulated or corrupted. Always verify the integrity and accuracy of the data you scrape.
  • Impact on Target Servers: Excessive scraping can impact the performance of the target website. Be considerate of the website's resources. Implementing polite scraping practices like honoring robots.txt and using caching can help reduce the load on target servers.

In conclusion, while proxy servers are powerful tools for web scraping and data extraction, they come with responsibilities. Ethical and secure use of proxy servers not only protects you from legal repercussions but also respects the rights and resources of the target websites.

Conclusion

In this comprehensive exploration of creating and managing proxy servers using Proxy.py, we have covered a range of topics from the basics of setting up a simple proxy server to implementing advanced features like rotating proxies. We've also delved into the ethical and security considerations that are crucial in the world of web scraping.

As with any powerful tool, it's important to use proxy.py responsibly. Adhering to legal guidelines, respecting the terms of service of websites, and considering the ethical implications of your scraping activities are essential practices. Remember, the goal is to extract data in a way that is sustainable, respectful, and compliant with legal standards.

In closing, the journey through web scraping and proxy management is an ongoing learning process. Tools like Proxy.py are continuously evolving, and staying updated with the latest developments will greatly enhance your capabilities as a web scraping specialist.

Happy Web Scraping and don't forget to check out other articles at this web scraping blog 🤓

Forget about getting blocked while scraping the Web

Try out ScrapingAnt Web Scraping API with thousands of proxy servers and an entire headless Chrome cluster