To create a fully-featured web scraper, you should solve a group of aspects like:
- how to extract data (retrieve required data from the website)
- how to parse data (pick only the required information)
- how to provide or store parsed data
Let's consider a simple NodeJS web scraper, which will get the title text from the site example.com:
HTTP client is a tool that provides the ability to communicate with servers via HTTP protocol. In simple words, it's a module or a library that capable of sending requests and receive responses from the servers.
Usually, an HTTP client can be only one tool for covering data extraction from the website: it allows sending a request to a web server for receiving HTML content, and a response contains requested HTML. More complex data extraction tools usually include HTTP clients under the hood.
Axios you can use
npm or your favorite package manager like
The library usage is relatively simple and showed in the example below:
The second most popular HTTP client is SuperAgent. It has both promise and callback interfaces and reliable community support, but it is less popular for some reason.
Library installation is the same simple as for the
The example below demonstrates how to use SuperAgent via supported interfaces (promises and callbacks):
Almost every tutorial on the Internet suggests using
request when making an API call or retrieving a web page from the server. Still, the package is currently unmaintained and deprecated. I'd not suggest using it for new projects. However, it might help a legacy codebase when needed to create a few changes without making refactoring.
Check out a link to Github to learn more details about
request if you're still would like to use it.
Usually, the retrieved website content is an HTML code of the whole web page, but the web scraping process's target is to get specific information like product title, price, image URL from the entire page content.
In the example at the start of this article, we've used a regular expression to extract the title from
example.com content. This method is beneficial for parsing strict-structured data like telephone numbers, emails, etc. but is unnecessarily complicated for common cases.
The libraries below help to create a well-structured, maintainable, and readable codebase without RegExp.
jQuery provides the most efficient and straightforward API to parse and manipulate DOM, so Cheerio will be the same native for you if you are familiar with jQuery API.
It's also simple to rewrite the example from the article start for Cheerio usage:
For the extended usage sample, please, check our article: Amazon Scraping. Relatively easy.
JSDOM is more than just a parser. It acts like a browser. It means that it would automatically add the necessary tags if you omit them from the data you are trying to parse. Also, that fact allows you to convert extracted HTML data to a DOM and interact with elements, manipulate the tree structure and nodes, etc.
As we're started to use our first example as a boilerplate, let's replace the Cheerio with JSDOM and check out the end result:
As you can observe, we've moved away from the jQuery helpers and started manipulating DOM.
The API is rich and includes many helpful features (and explanation about using
runScripts: "dangerously" above 🙂), so I highly recommend checking out the documentation.
Selenium is a popular web automation tool with a bunch of wrappers for different programming languages. The main idea of this library is to provide a web driver capable of controlling the browser.
Selenium features are pretty broad: keyboard input emulation, form filling, CAPTCHA resolving, interacting with buttons, links, etc.
The example below shows how to use keyboard input with a Google Search:
Puppeteer is a Node.js library that offers a simple and efficient API and enables you to control Google’s Chrome or Chromium browser. It's a powerful tool as it allows you to crawl the web as if a real user were surfing a website with a browser.
It can be installed by running the following command:
The following code demonstrate basic concepts of the Puppeteer usage (taking a screenshot):
We have a great example of using Puppeteer for scraping Angular-based websites, and you can check it here: AngularJS site scraping. Easy deal?.
Playwright is a library that can be called Puppeteer's successor, but with Microsoft maintenance. It even has the same maintainers that Puppeteer previously had. API will be very familiar for developers who already tried Puppeteer. Still, unlike Puppeteer, it supports Chromium, Webkit, and Firefox backend, so you'll be capable of managing all three browser types with a single API.
Kick-off is pretty smooth and have the same installation steps as previous libraries:
And the example below shows how to take a screenshot of ScrapingAnt landing page with using three different supported browsers:
The Playwright's documentation is well structured and offers searchability over the API aspects, so it should be easy to find the answer quickly.
ScrapingAnt API itself handles headless Chrome and a pool of thousands of proxies under the hood that helps you not maintain your own Puppeteer or Playwright cluster and make an API call instead.
In simple words, each time you make a call to the web scraping API, ScrapingAnt runs a headless Chrome and opens the target URL via one of the proxies. Such a scheme allows you to avoid blocking and rate-limiting, so your web scraper will always receive the extracted data.
For obtaining your API token, please, log in to the dashboard. It's free for personal usage.
Hopefully, the further reading can help you to reach more detailed information:
- General Web Scraping techniques - how to get desired information from the web page.
- 6 Puppeteer Tricks to Avoid Detection and Make Web Scraping Easier - tips and tricks for Puppeteer.
- Scraping with millions of browsers or Puppeteer Cluster - how to scrape with Puppeteer at scale.
- ScrapingAnt documentation - tons of information about web scraping and usage of web scraping API.
Happy web scraping, and don't forget to check websites policies regarding the scraping bots 😉