HTML parsing is a vital part of web scraping, as it allows convert web page content to meaningful and structured data. Still, as HTML is a tree-structured format, it requires a proper tool for parsing, as it can't be property traversed using Regex.
This article will reveal the most popular .NET libraries for HTML parsing with their strong and weak parts.
Let's have a quick review of the libraries with their licenses, nuances, etc.
HtmlAgilityPack is one of the most (if not the most) famous HTML parsing libraries in the .NET world. As a result, many articles have been written about it.
In short, it is a fast, relatively handy library for working with HTML (assuming XPath queries are simple).
This parsing library will be convenient if the task is typical and well described by an XPath expression. For example, to get all the links from a page, we need very little code:
Still, CSS classes usage is not convenient for this library and requires creating more complex expressions:
Of the observed oddities - a specific API, sometimes incomprehensible and confusing. However, the fact that the library is no longer abandoned adds enthusiasm and makes it a real alternative to AngleSharp.
AngleSharp is written from scratch using C#.
The library code is clean, neat, and user-friendly.
CsQuery is a jQuery port for .NET. It implements all CSS2 & CSS3 selectors, all the DOM manipulation methods of jQuery, and some of the utility methods.
It was one of the modern HTML parsers for .NET. The library was based on the validator.nu parser for Java, which in turn is a port of the parser from the Gecko (Firefox) engine.
Unfortunately, the project is abandoned by the author. Recommended alternative to it is AngleSharp.
The code for getting links from a page looks nice and familiar to anyone who has used jQuery:
Fizzler is an add-on to HtmlAgilityPack (the Fizzler's implementation is based on HtmlAgilityPack), allowing you to use CSS selectors.
Let's discover what problem solves Fizzler using the sample from the documentation:
It is almost the same speed as HtmlAgilityPack, but more convenient because of the CSS selectors.
Regex is ancient and not a good approach for parsing HTML. However, this way allows you to perform the task much faster than using libraries that build a DOM tree.
If it comes to regular expressions, you should understand that you can not build a universal and absolutely reliable solution on them. However, if you want to parse a specific site, this problem may not be so critical.
The license info.
The code for getting links from the page still looks clear:
If you suddenly want to parse tables with Regex, and even in a fancy format, please look here first.
Parser speed is, after all, one of the most important attributes. HTML parsing speed determines how long it will take you to finish a given task.
To measure parser performance I used the BenchmarkDotNet library from DreamWalker.
The measurements were made on an Intel® Core(TM) i9-9880H CPU @ 2.30GHz, but experience tells us that the relative time will be the same on any other configuration.
Regex is an excellent tool, but working with HTML is not the task of using it. As an experiment, however, I did try to implement a minimal working version of the code. While it works perfectly, the amount of time I spent programming suggests that I definitely wouldn't do it again.
Well, let's take a look at the benchmarks.
This task seems to me to be basic for all parsers - more often than not, this is how an introduction to the world of parsers (sometimes Regex as well) begins.
As a scraping example, I've used the main page of ScrapingAnt:
The benchmark code can be found on Github, and there is a table with the results below:
|HtmlAgilityPack||3.653 ms||0.087 ms||3.579 ms|
|AngleSharp||5.864 ms||0.091 ms||5.853 ms|
|CsQuery||14.269 ms||0.284 ms||13.931 ms|
|Fizzler||4.147 ms||0.081 ms||4.105 ms|
|Regex||0.547 ms||0.010 ms||0.543.0 ms|
Generally, Regex was expectedly the fastest but far from the most comfortable. HtmlAgilityPack and Fizzler showed approximately the same processing time, slightly ahead of AngleSharp. CsQuery, unfortunately, was hopelessly behind.
This task is very common among some visitors to our site, as we provide a constantly updated list of free proxies for web scraping in the public domain.
The code of all the libraries is about the same, the only difference is the API.
However, there are two things worth mentioning: first, AngleSharp has specialized interfaces, which made the task easier. Second, Regex is not suitable for this task at all.
|HtmlAgilityPack||3.323 ms||0.0947 ms||3.317 ms|
|AngleSharp||3.920 ms||0.0557 ms||3.929 ms|
|CsQuery||8.475 ms||0.2227 ms||8.400 ms|
|Fizzler||3.217 ms||0.0637 ms||3.205 ms|
|Regex||9.636 ms||0.1904 ms||9.456 ms|
As in the previous example, HtmlAgilityPack, AngleSharp, and Fizzler showed about the same and very good times.
To my surprise, CsQuery and Regex showed equally bad processing times. While everything is clear with CsQuery - it's just slow, with Regex it's not so clear - most likely the problem can be solved in a more optimal way.
The conclusions, probably, everyone has made for himself. However, I'd add that the best choice, for now, would be AngleSharp, because it's under active development, has an intuitive API, and shows good processing times.
Does it make sense to switch to AngleSharp from HtmlAgilityPack? Probably not - you can use Fizzler and enjoy a speedy and convenient library.
The benchmark code can be found here.
I suggest continuing with the following links to learn more:
- HTML Parsing Libraries - C# - quick HtmlAgilityPack vs AngleSharp review
- HTML Parsing Libraries - Java - Java HTML parsing libraries overview
Happy Web Scraping, and don't forget to organize your HTML parsing selectors 📚