Web Scraping with Javascript (NodeJS)

Web Scraping with Javascript

Javascript (JS) becomes more popular as a programming language for web scraping. The whole domain becomes more demanded, and more technical specialists try to start data mining with a handy scripting language. Let's check out the main concepts of web scraping with Javascript and review the most popular libraries to improve data extraction flow.

What is NodeJS?

NodeJS is an execution environment (runtime) for the Javascript code that allows implementing server-side and command-line applications. It is based on the Chrome V8 engine and runs on Windows 7 or later, macOS 10.12+, and Linux systems that use x64, IA-32, ARM, or MIPS processors.

Back in the days, Javascript was introduced as a web pages scripting language only, which adds dynamic behavior to sites. It becomes trendy because of its simplicity and single-thread execution, so entry-level developers can quickly start with it and add special effects and behavior to their homepages, corporate websites, blogs, forums, etc. Each time the Javascript code is loaded with a browser, the internal Javascript Engine interprets and executes it within the browser context, which provides programmatic and dynamic access to almost every part of the webpage in real-time while the end-user observing content.

In the opposite of the browser's environment, NodeJS provides a server-side (machine) environment, allowing executing Javascript code without a browser itself and having more control over the program life cycle. It's lightweight, simple, and allows doing the same things as Java or .NET from the web development perspective.

How to create a web scraper with Javascript?

To create a fully-featured web scraper, you should solve a group of aspects like:

how to extract data (retrieve required data from the website)
how to parse data (pick only the required information)
how to provide or store parsed data

Let's consider a simple NodeJS web scraper, which will get the title text from the site example.com:

const axios = require('axios');
const http = require('http');
const PORT = 1337;
const TITLE_REGEX = /<title>(.*?)<\/title>/;

const server = http.createServer((req, res) => {
    axios
        .get('https://example.com')
        .then((response) => {
            const title = response.data.match(TITLE_REGEX)[1];
            res.statusCode = 200;
            res.setHeader('Content-Type', 'text/plain');
            res.end(title);
        })
        .catch((error) => {
            console.error(error);
        });
});

server.listen(PORT, () => {
    console.log(`Server running at PORT:${PORT}`);
});

The example above uses axios library to get the HTML content from example.com, regular expression to parse the title, and http module to serve the result via the web server endpoint.

Below you can find various libraries that help cover different aspects of Javascript web scraper and simplify your codebase.

Making requests: HTTP clients

HTTP client is a tool that provides the ability to communicate with servers via HTTP protocol. In simple words, it's a module or a library that capable of sending requests and receive responses from the servers.

Usually, an HTTP client can be only one tool for covering data extraction from the website: it allows sending a request to a web server for receiving HTML content, and a response contains requested HTML. More complex data extraction tools usually include HTTP clients under the hood.

There are several options for NodeJS: Axios, SuperAgent, Got, Node Fetch, but we'll review only the two most popular (by the Github stars count).

Axios

Axios is a simple and modern promise based HTTP client that can be used for client-side and server-side applications.

I can recommend using it as an alternative to the deprecated request library. The community support is excellent, and the number of opened Github issues is relatively small to closed ones.

To install Axios you can use npm or your favorite package manager like yarn:

npm install axios

The library usage is relatively simple and showed in the example below:

const axios = require('axios');

axios
    .get('https://scrapingant.com')
    .then((response) => {
        // Exact HTML content is stored inside `data` field
        console.log(response.data)
    })
    .catch((error) => {
        console.error(error)
    });

Check out Axios Github repository and the official documentation to learn more.

SuperAgent

The second most popular HTTP client is SuperAgent. It has both promise and callback interfaces and reliable community support, but it is less popular for some reason.

Library installation is the same simple as for the axios:

npm install superagent

yarn add superagent

The example below demonstrates how to use SuperAgent via supported interfaces (promises and callbacks):

const superagent = require('superagent');
const webScrapingAPIHome = 'https://scrapingant.com';

// callback
superagent
  .get(webScrapingAPIHome)
  .end((err, res) => {
      // Calling the end function will send the request
      console.log(res);
      console.error(err);
  });

// promise with then/catch
superagent.get(webScrapingAPIHome)
    .then(console.log)
    .catch(console.error);

// promise with async/await
(async () => {
    try {
        const res = await superagent.get(webScrapingAPIHome);
        console.log(res);
    } catch (err) {
        console.error(err);
    }
})();

You can find more detailed information about the library in the official documentation or Github repository.

Request (deprecated)

Almost every tutorial on the Internet suggests using request when making an API call or retrieving a web page from the server. Still, the package is currently unmaintained and deprecated. I'd not suggest using it for new projects. However, it might help a legacy codebase when needed to create a few changes without making refactoring.

Check out a link to Github to learn more details about request if you're still would like to use it.

HTML parsing: Cheerio and JSDOM

Usually, the retrieved website content is an HTML code of the whole web page, but the web scraping process's target is to get specific information like product title, price, image URL from the entire page content.

In the example at the start of this article, we've used a regular expression to extract the title from example.com content. This method is beneficial for parsing strict-structured data like telephone numbers, emails, etc. but is unnecessarily complicated for common cases.

The libraries below help to create a well-structured, maintainable, and readable codebase without RegExp.

Cheerio: core jQuery for the server

Cheerio implements a subset of core jQuery. In simple words - you can swap your jQuery and Cheerio environments for web scraping.

jQuery provides the most efficient and straightforward API to parse and manipulate DOM, so Cheerio will be the same native for you if you are familiar with jQuery API.

const cheerio = require('cheerio');
const $ = cheerio.load('<h2 class="title">ScrapingAnt is awesome!</h2>');

$('h2.title').text('Hello there!');
$('h2').addClass('welcome');

$.html();
//=> <html><head></head><body><h2 class="title welcome">Hello there!</h2></body></html>

It's also simple to rewrite the example from the article start for Cheerio usage:

const axios = require('axios');
const cheerio = require('cheerio');
const http = require('http');
const PORT = 1337;

const server = http.createServer((req, res) => {
    axios
        .get('https://example.com')
        .then((response) => {
            const $ = cheerio.load(response.data);
            const title = $("title").text();
            res.statusCode = 200;
            res.setHeader('Content-Type', 'text/plain');
            res.end(title);
        })
        .catch((error) => {
            console.error(error)
        });
});

server.listen(PORT, () => {
    console.log(`Server running at PORT:${PORT}`);
});

For the extended usage sample, please, check our article: Amazon Scraping. Relatively easy.

Also, check out the documentation: official docs and GitHub repository

JSDOM: DOM implementation for use with Node.js

JSDOM is another great DOM parsing library with extensive capabilities. We've mentioned it before in our article HTML Parsing Libraries - JavaScript.

JSDOM is more than just a parser. It acts like a browser. It means that it would automatically add the necessary tags if you omit them from the data you are trying to parse. Also, that fact allows you to convert extracted HTML data to a DOM and interact with elements, manipulate the tree structure and nodes, etc.

As we're started to use our first example as a boilerplate, let's replace the Cheerio with JSDOM and check out the end result:

const axios = require('axios');
const { JSDOM } = require('jsdom');
const http = require('http');
const PORT = 1337;

const server = http.createServer((req, res) => {
    axios
        .get('https://example.com')
        .then((response) => {
            const { document } = new JSDOM(response.data).window;
            const title = document.querySelector('title').textContent;
            res.statusCode = 200;
            res.setHeader('Content-Type', 'text/plain');
            res.end(title);
        })
        .catch((error) => {
            console.error(error)
        });
});

server.listen(PORT, () => {
    console.log(`Server running at PORT:${PORT}`);
});

As you can observe, we've moved away from the jQuery helpers and started manipulating DOM.

In short, JSDOM has all you need to simulate a browser environment. It can also deal with external resources, even JavaScript scripts, which can be loaded and executed:

const dom = new JSDOM(`<body>
  <script>document.body.appendChild(document.createElement("hr"));</script>
</body>`, { runScripts: "dangerously" });

// The script will be executed and modify the DOM:
dom.window.document.body.children.length === 2;

The API is rich and includes many helpful features (and explanation about using runScripts: "dangerously" above 🙂), so I highly recommend checking out the documentation.

Selenium: web browser automation

Selenium is a popular web automation tool with a bunch of wrappers for different programming languages. The main idea of this library is to provide a web driver capable of controlling the browser.

Selenium features are pretty broad: keyboard input emulation, form filling, CAPTCHA resolving, interacting with buttons, links, etc.

The example below shows how to use keyboard input with a Google Search:

const {Builder, By, Key, until} = require('selenium-webdriver');

(async function example() {
  let driver = await new Builder().forBrowser('firefox').build();
  try {
    await driver.get('https://www.google.com/');
    await driver.findElement(By.name('q'));
    await driver.sendKeys('scrapingant', Key.RETURN);
    await driver.wait(until.titleIs('scrapingant - Google Search'), 1000);
  } finally {
    await driver.quit();
  }
})();

As usual, check out the official documentation and Github repository for more information.

Puppeteer: headless Chrome

Puppeteer is a Node.js library that offers a simple and efficient API and enables you to control Google’s Chrome or Chromium browser. It's a powerful tool as it allows you to crawl the web as if a real user were surfing a website with a browser.

Puppeteer allows you to use headless Chrome for the Javascript rendering, so you can run a particular site's JavaScript (as well as with Selenium) and scrape single-page applications based on Vue.js, React.js, Angular, etc. It's capable of taking screenshots, intercepting browser requests, accessing Chrome developer tools, and much more.

It can be installed by running the following command:

npm install puppeteer

The following code demonstrate basic concepts of the Puppeteer usage (taking a screenshot):

const puppeteer = require('puppeteer');

(async () => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  await page.goto('https://scrapingant.com');
  await page.screenshot({ path: 'scrapingant.png' });

  await browser.close();
})();

We have a great example of using Puppeteer for scraping Angular-based websites, and you can check it here: AngularJS site scraping. Easy deal?.

There are tons of helpful Puppeteer resources gathered in one place. Check out the curated list and the official website.

Playwright: scrape with Chrome, Firefox, and Webkit

Playwright is a library that can be called Puppeteer's successor, but with Microsoft maintenance. It even has the same maintainers that Puppeteer previously had. API will be very familiar for developers who already tried Puppeteer. Still, unlike Puppeteer, it supports Chromium, Webkit, and Firefox backend, so you'll be capable of managing all three browser types with a single API.

Kick-off is pretty smooth and have the same installation steps as previous libraries:

npm install playwright

And the example below shows how to take a screenshot of ScrapingAnt landing page with using three different supported browsers:

const playwright = require('playwright');

(async () => {
  for (const browserType of ['chromium', 'firefox', 'webkit']) {
    const browser = await playwright[browserType].launch();
    const context = await browser.newContext();
    const page = await context.newPage();
    await page.goto('https://scrapingant.com');
    await page.screenshot({ path: `scrapingant-${browserType}.png` });
    await browser.close();
  }
})();

The Playwright's documentation is well structured and offers searchability over the API aspects, so it should be easy to find the answer quickly.

ScrapingAnt Javascript client

ScrapingAnt Javascript client is an HTTP client for web scraping API that allows scraping web pages in the simplest possible way.

ScrapingAnt API itself handles headless Chrome and a pool of thousands of proxies under the hood that helps you not maintain your own Puppeteer or Playwright cluster and make an API call instead.

In simple words, each time you make a call to the web scraping API, ScrapingAnt runs a headless Chrome and opens the target URL via one of the proxies. Such a scheme allows you to avoid blocking and rate-limiting, so your web scraper will always receive the extracted data.

Here is the example of ScrapingAnt Javascript client usage:

const ScrapingAntClient = require('@scrapingant/scrapingant-client');

const client = new ScrapingAntClient({ apiKey: '<YOUR-SCRAPINGANT-API-KEY>' });

// Scrape the example.com site.
client.scrape('https://example.com')
    .then(res => console.log(res))
    .catch(err => console.error(err.message));

For obtaining your API token, please, log in to the dashboard. It's free for personal usage.

Conclusion

It's pretty hard to put all the Javascript scraping stack into one article, but it should be an excellent primer for the next steps into web scraping. NodeJS web scraping ecosystem provides a lot of abilities to perform and solve various data mining tasks.

Hopefully, the further reading can help you to reach more detailed information:

General Web Scraping techniques - how to get desired information from the web page.
HTML Parsing Libraries - JavaScript - alternatives to Cheerio and JSDOM.
6 Puppeteer Tricks to Avoid Detection and Make Web Scraping Easier - tips and tricks for Puppeteer.
Scraping with millions of browsers or Puppeteer Cluster - how to scrape with Puppeteer at scale.
ScrapingAnt documentation - tons of information about web scraping and usage of web scraping API.

Happy web scraping, and don't forget to check websites policies regarding the scraping bots 😉

Web Scraping with Javascript (NodeJS)

What is NodeJS?

How to create a web scraper with Javascript?

Making requests: HTTP clients

Axios

SuperAgent

Request (deprecated)

HTML parsing: Cheerio and JSDOM

Cheerio: core jQuery for the server

JSDOM: DOM implementation for use with Node.js

Selenium: web browser automation

Puppeteer: headless Chrome

Playwright: scrape with Chrome, Firefox, and Webkit

ScrapingAnt Javascript client

Conclusion

Forget about getting blocked while scraping the Web

Explore Residential Proxies

What is NodeJS?​

How to create a web scraper with Javascript?​

Making requests: HTTP clients​

Axios​

SuperAgent​

Request (deprecated)​

HTML parsing: Cheerio and JSDOM​

Cheerio: core jQuery for the server​

JSDOM: DOM implementation for use with Node.js​

Selenium: web browser automation​

Puppeteer: headless Chrome​

Playwright: scrape with Chrome, Firefox, and Webkit​

ScrapingAnt Javascript client​

Conclusion​

Forget about getting blocked while scraping the Web

Explore Residential Proxies

What is NodeJS?

How to create a web scraper with Javascript?

Making requests: HTTP clients

Axios

SuperAgent

Request (deprecated)

HTML parsing: Cheerio and JSDOM

Cheerio: core jQuery for the server

JSDOM: DOM implementation for use with Node.js

Selenium: web browser automation

Puppeteer: headless Chrome

Playwright: scrape with Chrome, Firefox, and Webkit

ScrapingAnt Javascript client

Conclusion