Skip to main content

Open Source Javascript Web Scraping

· 4 min read
Oleg Kulyk

Open Source Javascript Web Scraping

In this article, I’d like to list some most popular Javascript open-source projects that can be useful for web scraping. It consists of both libraries and standalone niche scrapers that can scrape a particular site (Amazon, iTunes, Instagram, Google Play, etc.)

Awesome Open Source Javascript Projects for Web Scraping

HTTP interaction

  • Axios: Promise based HTTP client for the browser and node.js.
    Features: XMLHttpRequests from the browser, HTTP requests from node.js, Promise API, intercepting of request and response, transforming of request and response, automatic transforming for JSON data.
  • Got: Human-friendly and powerful HTTP request library for Node.js.
    Features: HTTP/2 support, Promise API, Stream API, Pagination API, Cookies (out-of-box), Progress events.
  • Superagent: Small progressive client-side HTTP request library, and Node.js module with the same API, supporting many high-level HTTP client features.
    Features: HTTP/2 support, Promise API, Stream API, Request cancelation, Follows redirects, Retries on failure, Progress events.

DOM manipulation and HTML parsing

  • Cheerio: Fast, flexible & lean implementation of core jQuery designed specifically for the server.
    Cheerio parses markup and provides an API for traversing/manipulating the resulting data structure. It does not interpret the result as a web browser does. Specifically, it does not produce a visual rendering, apply CSS, load external resources, or execute JavaScript.
  • jsdom: jsdom is a pure-JavaScript implementation of many web standards, notably the WHATWG DOM and HTML Standards, for use with Node.js. In general, the goal of the project is to emulate enough of a subset of a web browser to be useful for testing and scraping real-world web applications.
  • htmlparser2: A forgiving HTML/XML/RSS parser. The parser can handle streams and provides a callback interface. This module started as a fork of the htmlparser module. The main difference is that htmlparser2 is intended to be used only with NodeJs (it runs on other platforms using browserify). htmlparser2 was rewritten multiple times and, while it maintains an API that's compatible with htmlparser in most cases, the projects don't share any code anymore.

Javascript execution and rendering

  • Puppeteer: Puppeteer is a NodeJS library that provides a high-level API to control Chrome or Chromium over the DevTools Protocol. Puppeteer runs headless by default, but can be configured to run full (non-headless) Chrome or Chromium.
  • Awesome resources for Puppeteer: https://github.com/transitive-bullshit/awesome-puppeteer
  • Selenium: Selenium is an umbrella project encapsulating a variety of tools and libraries enabling web browser automation. Selenium specifically provides an infrastructure for the W3C WebDriver specification — a platform and language-neutral coding interface compatible with all major web browsers.
  • Playwright: Playwright is a Node library to automate Chromium, Firefox, and WebKit with a single API. Playwright is built to enable cross-browser web automation that is ever-green, capable, reliable, and fast.
  • PhantomJS: PhantomJS (phantomjs.org) is a headless WebKit scriptable with JavaScript. Fast and native implementation of web standards: DOM, CSS, JavaScript, Canvas, and SVG. No emulation!

Resource scrapers

  • amazon-scraper: Useful tool to scrape product information from the amazon
  • app-store-scraper: Node.js module to scrape application data from the iTunes/Mac App Store.
  • instagram-scraper: Since Instagram has removed the option to load public data through its API, this actor should help replace this functionality.
  • google-play-scraper: Node.js module to scrape application data from the Google Play store.
  • scrapedin: Scraper for LinkedIn full profile data. Unlike other scrapers, it's working in 2020 with their new website.
  • tiktok-scraper: Scrape and download useful information from TikTok.

And it's only the most interesting ones. Feel free to browse through Github to find out your best one!

Conclusion

JavaScript is not as popular a programming language for web scrapers as Python, but the community is growing and this list definitely will get bigger over some time.

Also, our web scraping API is language agnostic, so you can check it even if you're not very familiar with JS or Python.

Forget about getting blocked while scraping the Web

Try out ScrapingAnt Web Scraping API with thousands of proxy servers and an entire headless Chrome cluster