Using a quality proxy server is the key to a successful web scraper. A variety of IPs along with their quality make it possible to collect data from various web sites without worrying about being blocked.
Still, many websites provide free proxy lists, so can the process of getting IP addresses from them be automated? Are free proxies good enough for web scraping? Let's check it out.
Why do we need a proxy server for web scraping?
As you may know, web scraping is the process of extracting data from web pages. This means that the data on those pages has some value, which website administrators want to protect from the automated collection.
Proxy scraping tools
Each good proxy scraping tool should at least perform two functions: constantly retrieve and check proxy servers. The first function is usually implemented by crawling through the pre-defined list of free proxy websites and collecting IP/port information from them. The second part is a checker - function that iterates through the whole database of harvested proxies and tries to perform requests using each of them.
More extended proxy scrapers may collect metainformation about proxies like country, average latency, anonymity, etc., and provide an ability of a proxy rotation along with HTTP proxy forwarding, so you don't need to manage connection for each of the collected IP addresses.
Let's check out the most popular and promising open-source tools from this category.
Scylla - An Intelligent Proxy Pool
Scylla is one of my favourites across all the projects I've seen. It combines proxy crawler, checker and HTTP forward proxy all-in-one. Also, Scylla is under active development, so the project contributors will constantly deliver new features and fix bugs.
Here is a list of the best features:
- Automatic proxy IP crawling and validation
- Easy-to-use JSON API
- Simple HTTP Forward proxy server
- Docker image support
- Scrapy and requests integration with only 1 line of code minimally
- Headless browser crawling
The 4th point means, that Scylla installation as easy as the following line of the code:
docker run -d -p 8899:8899 -p 8081:8081 -v /var/www/scylla:/var/www/scylla --name scylla wildcat/scylla:latest
This simple command will run the whole Scylla infrastructure without any additional configuration.
To do the same action using
pip install scylla
scylla # Run the crawler and web server for JSON API
Those commands start Scylla crawler, JSON API (located at
8899 port by default) and HTTP forward proxy (located at
8081 port by default).
It means that after a few minutes from the project start (it needed to fill the database with some working proxies), you will be able to scrape the Web using free rotating proxies.
To check out the proxy rotation function, simply run the following command:
curl http://api.ipify.org -x http://127.0.0.1:8081
As an alternative to HTTP forward proxy usage, you can get a list of working proxies using JSON API:
Which leads to a paginated response:
"organization": "AS57099 Boundless Networks Limited",
"organization": "AS7922 Comcast Cable Communications, LLC",
I recommend Scylla as the most featured and community-driven free proxy scraper.
ProxyBroker - Public Proxies Scraper and Checker
The second promising tool is ProxyBroker. It is a popular proxy scraper with three nice-to-have features: proxy scraping, checking, and rotating through the built-in server.
The complete list of the features is the following:
- 50+ pre-packaged proxy sources
- Support protocols: HTTP(S), SOCKS4/5. Also CONNECT method to ports 80 and 23 (SMTP)
- Proxy filtering by type, anonymity level, response time, country and status in DNSBL
- Automatic proxy rotation with HTTP forwarding proxy
- Cookies and Referer checker
- Automatic duplicates avoidance
Unfortunately, the project is abandoned, but it's relatively easy to take it as a base for your own proxy scraper, as we did it for our free proxy list.
ProxyBroker requires Python 3.5+ to start, and the whole installation can be performed using
pip install proxybroker
The usage is simple and can be performed using command line. For example, to get 10 high anonymity proxies from the USA:
$ proxybroker find --types HTTP HTTPS --lvl High --countries US --strict -l 10
Or to serve all high anonymity proxies from the USA using a rotating proxy server (will be launched on
proxybroker serve --host 127.0.0.1 --port 8888 --types HTTP HTTPS --lvl High --countries US
ProxyBroker can also be integrated in your Python code, so it is easy to extend it with your functionality. Check out more information in the official documentation.
proxy-scraper - starting point to build your own proxy scraper
proxy-scraper is a simple CLI tool built using Python. It has not too many features or integrations, but it is a great open-source project to start building your solution.
Despite its simplicity and lack of extra functionality, this project demonstrates well the main aspects of the proxy scraper group. Furthermore, it allows you to create your proxy collector in a short time (even using a different programing language).
Why free proxies should be avoided?
Despite the apparent availability of free proxies, they have a number of drawbacks that prevent them from being used in production.
Since free proxy lists are free, each proxy protection may not be of high quality. Meaning that data transferred using such a proxy is susceptible to hacking, malware, and cyber-attacks leading to theft of information.
Most people choose a free proxy for their web scraping projects as well as for web surfing. The saturation may lead to slow internet connections, which is not a big deal if the amount of traffic that is supposed to be transferred is small. Otherwise, if you do a lot of internet communication operations, the speed could be frustrating.
Publicly available free proxy servers are operational a small fraction of the time due to their high popularity. Thus, a recently working IP address may be unavailable or unstable, which can lead to unstable operation of a web scraping program.
Free (publicly available) proxy servers are a good and educational starting point for experimenting with web scraping. Still, their quality is far from perfect, so spending some money for a web scraping API or a paid proxy service will lead to a drastic improvement.
As usual, check out our best picks:
- Free and Publicly Available Proxies - our list of scraped and checked proxy servers
- Residential VS Datacenter Proxies for Web Scraping - what is the difference between residential and datacenter proxy
- Three Reasons You Might Reconsider Getting a Free Proxy Server for Web Scraping - why using a free proxy server for web scraping is not the best idea
Happy Web Scraping, and don't forget to change your web scraper fingerprint 🕵️