Skip to main content

Best Free Proxy Scraping Tools

· 7 min read
Oleg Kulyk

Best open source proxy scrapers

Using a quality proxy server is the key to a successful web scraper. A variety of IPs along with their quality make it possible to collect data from various web sites without worrying about being blocked.

Still, many websites provide free proxy lists, so can the process of getting IP addresses from them be automated? Are free proxies good enough for web scraping? Let's check it out.

Why do we need a proxy server for web scraping?

As you may know, web scraping is the process of extracting data from web pages. This means that the data on those pages has some value, which website administrators want to protect from the automated collection.

There are plenty of possible data protection mechanisms like fingerprinting, CAPTCHA, Javascript challenges. Still, the basic ones are IP restriction or IP rate limiting, which automatically or manually protects websites from particular IP access. A proxy server allows web scrapers to avoid such protection by changing the requester's IP and preventing further restriction.

Proxy scraping tools

Each good proxy scraping tool should at least perform two functions: constantly retrieve and check proxy servers. The first function is usually implemented by crawling through the pre-defined list of free proxy websites and collecting IP/port information from them. The second part is a checker - function that iterates through the whole database of harvested proxies and tries to perform requests using each of them.

More extended proxy scrapers may collect metainformation about proxies like country, average latency, anonymity, etc., and provide an ability of a proxy rotation along with HTTP proxy forwarding, so you don't need to manage connection for each of the collected IP addresses.

Let's check out the most popular and promising open-source tools from this category.

Scylla - An Intelligent Proxy Pool

Scylla is one of my favourites across all the projects I've seen. It combines proxy crawler, checker and HTTP forward proxy all-in-one. Also, Scylla is under active development, so the project contributors will constantly deliver new features and fix bugs.

Here is a list of the best features:

  • Automatic proxy IP crawling and validation
  • Easy-to-use JSON API
  • Simple HTTP Forward proxy server
  • Docker image support
  • Scrapy and requests integration with only 1 line of code minimally
  • Headless browser crawling

The 4th point means, that Scylla installation as easy as the following line of the code:

docker run -d -p 8899:8899 -p 8081:8081 -v /var/www/scylla:/var/www/scylla --name scylla wildcat/scylla:latest

This simple command will run the whole Scylla infrastructure without any additional configuration.

To do the same action using pip:

pip install scylla
scylla # Run the crawler and web server for JSON API

Those commands start Scylla crawler, JSON API (located at 8899 port by default) and HTTP forward proxy (located at 8081 port by default).

It means that after a few minutes from the project start (it needed to fill the database with some working proxies), you will be able to scrape the Web using free rotating proxies.

To check out the proxy rotation function, simply run the following command:

curl http://api.ipify.org -x http://127.0.0.1:8081

As an alternative to HTTP forward proxy usage, you can get a list of working proxies using JSON API:

http://localhost:8899/api/v1/proxies

Which leads to a paginated response:

{
"proxies": [{
"id": 599,
"ip": "91.229.222.163",
"port": 53281,
"is_valid": true,
"created_at": 1527590947,
"updated_at": 1527593751,
"latency": 23.0,
"stability": 0.1,
"is_anonymous": true,
"is_https": true,
"attempts": 1,
"https_attempts": 0,
"location": "54.0451,-0.8053",
"organization": "AS57099 Boundless Networks Limited",
"region": "England",
"country": "GB",
"city": "Malton"
}, {
"id": 75,
"ip": "75.151.213.85",
"port": 8080,
"is_valid": true,
"created_at": 1527590676,
"updated_at": 1527593702,
"latency": 268.0,
"stability": 0.3,
"is_anonymous": true,
"is_https": true,
"attempts": 1,
"https_attempts": 0,
"location": "32.3706,-90.1755",
"organization": "AS7922 Comcast Cable Communications, LLC",
"region": "Mississippi",
"country": "US",
"city": "Jackson"
},
...
],
"count": 1025,
"per_page": 20,
"page": 1,
"total_page": 52
}

I recommend Scylla as the most featured and community-driven free proxy scraper.

ProxyBroker - Public Proxies Scraper and Checker

The second promising tool is ProxyBroker. It is a popular proxy scraper with three nice-to-have features: proxy scraping, checking, and rotating through the built-in server.

The complete list of the features is the following:

  • 50+ pre-packaged proxy sources
  • Support protocols: HTTP(S), SOCKS4/5. Also CONNECT method to ports 80 and 23 (SMTP)
  • Proxy filtering by type, anonymity level, response time, country and status in DNSBL
  • Automatic proxy rotation with HTTP forwarding proxy
  • Cookies and Referer checker
  • Automatic duplicates avoidance

Unfortunately, the project is abandoned, but it's relatively easy to take it as a base for your own proxy scraper, as we did it for our free proxy list.

ProxyBroker requires Python 3.5+ to start, and the whole installation can be performed using pip:

pip install proxybroker

The usage is simple and can be performed using command line. For example, to get 10 high anonymity proxies from the USA:

$ proxybroker find --types HTTP HTTPS --lvl High --countries US --strict -l 10

Or to serve all high anonymity proxies from the USA using a rotating proxy server (will be launched on 8888 port):

proxybroker serve --host 127.0.0.1 --port 8888 --types HTTP HTTPS --lvl High --countries US

ProxyBroker can also be integrated in your Python code, so it is easy to extend it with your functionality. Check out more information in the official documentation.

proxy-scraper - starting point to build your own proxy scraper

proxy-scraper is a simple CLI tool built using Python. It has not too many features or integrations, but it is a great open-source project to start building your solution.

proxy-scraper contains only two code files: proxyScraper.py and proxyChecker.py which perform each own job and interact through the output file (by default it is output.txt).

Despite its simplicity and lack of extra functionality, this project demonstrates well the main aspects of the proxy scraper group. Furthermore, it allows you to create your proxy collector in a short time (even using a different programing language).

Why free proxies should be avoided?

Despite the apparent availability of free proxies, they have a number of drawbacks that prevent them from being used in production.

Proxy Security

Since free proxy lists are free, each proxy protection may not be of high quality. Meaning that data transferred using such a proxy is susceptible to hacking, malware, and cyber-attacks leading to theft of information.

Proxy Speed

Most people choose a free proxy for their web scraping projects as well as for web surfing. The saturation may lead to slow internet connections, which is not a big deal if the amount of traffic that is supposed to be transferred is small. Otherwise, if you do a lot of internet communication operations, the speed could be frustrating.

Proxy Stability

Publicly available free proxy servers are operational a small fraction of the time due to their high popularity. Thus, a recently working IP address may be unavailable or unstable, which can lead to unstable operation of a web scraping program.

Conclusion

Free (publicly available) proxy servers are a good and educational starting point for experimenting with web scraping. Still, their quality is far from perfect, so spending some money for a web scraping API or a paid proxy service will lead to a drastic improvement.

As usual, check out our best picks:

Happy Web Scraping, and don't forget to change your web scraper fingerprint 🕵️

Forget about getting blocked while scraping the Web

Try out ScrapingAnt Web Scraping API with thousands of proxy servers and an entire headless Chrome cluster