Skip to main content

HTML Parsing Libraries - JavaScript

· 5 min read
Oleg Kulyk

HTML Parsing Libraries - JavaScript

HTML is a simple structured markup language and everyone who is going to write the web scraper should deal with HTML parsing. The goal of this article is to help you to find the right tool for HTML processing. We are not going to present libraries for more specific tasks, such as article extractors, product extractors, or web scrapers.

JavaScript HTML parsers

1. DOMParser

The native DOM manipulation capabilities of JavaScript and jQuery are great for simple parsing of HTML fragments. However, if you actually need to parse a complete HTML or XML source in a DOM document programmatically, there is a better solution: DOMParser. It is available in all modern browsers.

By using DOMParser you can easily parse the HTML document. Usually, you have to resort to trick the browser into parsing it for you, for instance by adding a new element to the current document.

The usage of DOMParser is quite simple and straightforward:

let domParser = new DOMParser();
let doc = domParser.parseFromString(stringContainingXMLSource, "application/xml");
// returns a Document, but not a SVGDocument and not a HTMLDocument

domParser = new DOMParser();
doc = domParser.parseFromString(stringContainingSVGSource, "image/svg+xml");
// returns a SVGDocument, which also is a Document.

domParser = new DOMParser();
doc = domParser.parseFromString(stringContainingHTMLSource, "text/html");
// returns a HTMLDocument, which also is a Document.

2. Cheerio

Fast, flexible, and lean implementation of core jQuery designed specifically for the server.

There is little more to say about Cheerio than it is jQuery on the server. It should be obvious, but we are going to state it anyway: it looks like jQuery, but there is no browser. This means that Cheerio parses HTML and makes it easy to manipulate, but it does not make things happen. It does not interpret the HTML as if it were in the browser; both in the sense that it might parse things differently from a browser and that the results of the parsing are not sent directly to the user. If you need that you will have to take care of it yourself.

The developer created this library because it wanted a lightweight alternative to jsdom, that was also quicker and less strict in parsing. The last thing it is needed to parse real and messy websites.

The syntax and usage of Cheerio should be very familiar to any JavaScript developer.

const cheerio = require('cheerio'),
$ = cheerio.load('<h3 class="title">Hello there!</h3>');

$('h3.title').text('There is nobody here!');
$('h3').attr('id', 'new_id');

//=> <h3 class="title" id="new_id">There is nobody here!</h3>

3. jsdom

jsdom is a pure-JavaScript implementation of many web standards, notably the WHATWG DOM and HTML Standards, for use with Node.js. In general, the goal of the project is to emulate enough of a subset of a web browser to be useful for testing and scraping real-world web applications.

So jsdom is more than an HTML parser, it works as a browser. In the context of parsing, it means that it would automatically add the necessary tags, if you omit them from the data you are trying to parse. For instance, if there were no html tag it would implicitly add it, just like a browser would do.

You can also optionally specify a few properties, like the URL of the document, referrer, or user agent. The URL is particularly useful if you need to parse links that contain local URLs.

Since it is not really related to parsing, we just mention that jsdom has a (virtual) console, support for cookies, etc. In short, it has all you need to simulate a browser environment. It can also deal with external resources, even JavaScript scripts. Which means that it can load and execute them if you ask. Note however that there are security risks in doing so, just like when you execute any external code. All of that has a number of caveats that you should read in the documentation.

const jsdom = require("jsdom");
const { JSDOM } = jsdom;
const dom = new JSDOM('<!DOCTYPE html><p>Hello, world</p>');

// => "Hello, world"

4. Parse5

parse5 provides nearly everything you may need when dealing with HTML.

Parse5 is a library meant to be used to build other tools but can also be used to parse HTML directly for simple tasks. It is easy to use, but the issue is that it does not provide the methods that the browser gives you to manipulate the DOM (e.g., getElementById).

The difficulty is also increased by the limited documentation: it is basically a series of questions that are answered with an API reference (e.g., “I need to parse an HTML string” => Use parse5.parse method). So, it is feasible to use it for simple DOM manipulation, but you are probably not going to want to.

On the other hand, parse5 lists an impressive series of projects that adopt it: jsdom, Angular2, and Polymer. So, if you need a reliable foundation for advanced manipulation or parsing of HTML, it is clearly a good choice.

const parse5 = require('parse5');

const document = parse5.parse('<!DOCTYPE html><html><head></head><body>Hello there!</body></html>');

console.log(document.childNodes[1].tagName); //=> 'html'

HTML parsing: summary

We have seen a few libraries, and you might be surprised that, despite the popularity of HTML, there are usually few mature choices. That is because while HTML is very popular and structurally simple, providing support for all the multiple standards is hard work.

On top of that, the actual HTML documents out there might be in the wrong form, according to the standard, but they still work in the browser. So, they must work with your library, too. Add to it the need to provide an easy way to traverse an HTML document, and the shortage is readily explained.

While there might not always be many choices, luckily there is always at least one good choice available for you and your parser.

All of those libraries are easily can be integrated with our web scraping API. Don't hesitate to check it out!

Forget about getting blocked while scraping the Web

Try out ScrapingAnt Web Scraping API with thousands of proxy servers and an entire headless Chrome cluster