Awesome Open Source Javascript Projects for Web Scraping

In this article, I’d like to list some most popular Javascript open-source projects that can be useful for web scraping. It consists of both libraries and standalone niche scrapers that can scrape a particular site (Amazon, iTunes, Instagram, Google Play, etc.)

HTTP interaction

  • Axios: Promise based HTTP client for the browser and node.js. 
    Features: XMLHttpRequests from the browser, HTTP requests from node.js, Promise API, intercepting of request and response, transforming of request and response, automatic transforming for JSON data
  • Got: Human-friendly and powerful HTTP request library for Node.js. 
    Features: HTTP/2 support, Promise API, Stream API, Pagination API, Cookies (out-of-box), Progress events.
  • Superagent: Small progressive client-side HTTP request library, and Node.js module with the same API, supporting many high-level HTTP client features. 
    Features: HTTP/2 support, Promise API, Stream API, Request cancelation, Follows redirects, Retries on failure, Progress events.

DOM manipulation and HTML parsing

  • Cheerio: Fast, flexible & lean implementation of core jQuery designed specifically for the server.
    Cheerio parses markup and provides an API for traversing/manipulating the resulting data structure. It does not interpret the result as a web browser does. Specifically, it does not produce a visual rendering, apply CSS, load external resources, or execute JavaScript.
  • jsdom: jsdom is a pure-JavaScript implementation of many web standards, notably the WHATWG DOM and HTML Standards, for use with Node.js. In general, the goal of the project is to emulate enough of a subset of a web browser to be useful for testing and scraping real-world web applications.
  • htmlparser2: A forgiving HTML/XML/RSS parser. The parser can handle streams and provides a callback interface.
    This module started as a fork of the htmlparser module. The main difference is that htmlparser2 is intended to be used only with node (it runs on other platforms using browserify). htmlparser2 was rewritten multiple times and, while it maintains an API that’s compatible with htmlparser in most cases, the projects don’t share any code anymore.

Javascript execution and rendering

  • Puppeteer: Puppeteer is a Node library which provides a high-level API to control Chrome or Chromium over the DevTools Protocol. Puppeteer runs headless by default, but can be configured to run full (non-headless) Chrome or Chromium.
  • Awesome resources for Puppeteer: https://github.com/transitive-bullshit/awesome-puppeteer
  • Selenium: Selenium is an umbrella project encapsulating a variety of tools and libraries enabling web browser automation. Selenium specifically provides infrastructure for the W3C WebDriver specification — a platform and language-neutral coding interface compatible with all major web browsers.
  • PlayWright: Playwright is a Node library to automate Chromium, Firefox and WebKit with a single API. Playwright is built to enable cross-browser web automation that is ever-green, capable, reliable and fast.
  • PhantomJS: PhantomJS (phantomjs.org) is a headless WebKit scriptable with JavaScript. Fast and native implementation of web standards: DOM, CSS, JavaScript, Canvas, and SVG. No emulation!

Resource scrapers

  • amazon-scraper: Useful tool to scrape product information from the amazon
  • app-store-scraper: Node.js module to scrape application data from the iTunes/Mac App Store.
  • instagram-scraper: Since Instagram has removed the option to load public data through its API, this actor should help replace this functionality.
  • google-play-scraper: Node.js module to scrape application data from the Google Play store.
  • scrapedin: Scraper for LinkedIn full profile data. Unlike other scrapers, it’s working in 2020 with their new website.
  • tiktok-scraper: Scrape and download useful information from TikTok.

Conclusion

Javascript is not so popular programming language for web scrapers as Python, but the community is growing and this list definitely will be bigger over some time.

Also, our scraping API is language agnostic, so you can check it even if you not very familiar with JS or Python.

Close Bitnami banner
Bitnami