HTML parsing is a vital part of web scraping, as it allows convert web page content to meaningful and structured data. Still, as HTML is a tree-structured format, it requires a proper tool for parsing, as it can't be property traversed using Regex.
This article will reveal the most popular .NET libraries for HTML parsing with their strong and weak parts.
HTML parsing libraries
Let's have a quick review of the libraries with their licenses, nuances, etc.
HtmlAgilityPack
HtmlAgilityPack is one of the most (if not the most) famous HTML parsing libraries in the .NET world. As a result, many articles have been written about it.
In short, it is a fast, relatively handy library for working with HTML (assuming XPath queries are simple).
This parsing library will be convenient if the task is typical and well described by an XPath expression. For example, to get all the links from a page, we need very little code:
public IEnumerable<string> HtmlAgilityPackParse()
{
HtmlDocument htmlSnippet = new HtmlDocument();
htmlSnippet.LoadHtml(Html);
List<string> hrefTags = new List<string>();
foreach (HtmlNode link in htmlSnippet.DocumentNode.SelectNodes("//a[@href]"))
{
HtmlAttribute att = link.Attributes["href"];
hrefTags.Add(att.Value);
}
return hrefTags;
}
Still, CSS classes usage is not convenient for this library and requires creating more complex expressions:
public IEnumerable<string> HtmlAgilityPackParse()
{
HtmlDocument hap = new HtmlDocument();
hap.LoadHtml(html);
HtmlNodeCollection nodes = hap
.DocumentNode
.SelectNodes("//h3[contains(concat(' ', @class, ' '), ' r ')]/a");
List<string> hrefTags = new List<string>();
if (nodes != null)
{
foreach (HtmlNode node in nodes)
{
hrefTags.Add(node.GetAttributeValue("href", null));
}
}
return hrefTags;
}
Of the observed oddities - a specific API, sometimes incomprehensible and confusing. However, the fact that the library is no longer abandoned adds enthusiasm and makes it a real alternative to AngleSharp.
AngleSharp
AngleSharp is written from scratch using C#.
The API is based on the official JavaScript HTML DOM specification. There are quirks in some places that are unusual for .NET developers (e.g., accessing an invalid index in a collection will return null instead of throwing an exception; there is a separate URL class; namespaces are very granular), but generally nothing critical.
The library code is clean, neat, and user-friendly.
For example, extracting links from the page is almost no different from Javascript and Python alternatives:
public IEnumerable<string> AngleSharpParse()
{
List<string> hrefTags = new List<string>();
var parser = new HtmlParser();
var document = parser.Parse(Html);
foreach (IElement element in document.QuerySelectorAll("a"))
{
hrefTags.Add(element.GetAttribute("href"));
}
return hrefTags;
}
CsQuery
CsQuery is a jQuery port for .NET. It implements all CSS2 & CSS3 selectors, all the DOM manipulation methods of jQuery, and some of the utility methods.
It was one of the modern HTML parsers for .NET. The library was based on the validator.nu parser for Java, which in turn is a port of the parser from the Gecko (Firefox) engine.
Unfortunately, the project is abandoned by the author. Recommended alternative to it is AngleSharp.
The code for getting links from a page looks nice and familiar to anyone who has used jQuery:
public IEnumerable<string> CsQueryParse()
{
List<string> hrefTags = new List<string>();
CQ cq = CQ.Create(Html);
foreach (IDomObject obj in cq.Find("a"))
{
hrefTags.Add(obj.GetAttribute("href"));
}
return hrefTags;
}
Fizzler
Fizzler is an add-on to HtmlAgilityPack (the Fizzler's implementation is based on HtmlAgilityPack), allowing you to use CSS selectors.
Let's discover what problem solves Fizzler using the sample from the documentation:
// Load the document using HTMLAgilityPack as normal
var html = new HtmlDocument();
html.LoadHtml(@"
<html>
<head></head>
<body>
<div>
<p class='content'>Fizzler</p>
<p>CSS Selector Engine</p></div>
</body>
</html>");
// Fizzler for HtmlAgilityPack is implemented as the
// QuerySelectorAll extension method on HtmlNode
var document = html.DocumentNode;
// yields: [<p class="content">Fizzler</p>]
document.QuerySelectorAll(".content");
// yields: [<p class="content">Fizzler</p>,<p>CSS Selector Engine</p>]
document.QuerySelectorAll("p");
// yields empty sequence
document.QuerySelectorAll("body>p");
// yields [<p class="content">Fizzler</p>,<p>CSS Selector Engine</p>]
document.QuerySelectorAll("body p");
// yields [<p class="content">Fizzler</p>]
document.QuerySelectorAll("p:first-child");
It is almost the same speed as HtmlAgilityPack, but more convenient because of the CSS selectors.
Regex
Regex is ancient and not a good approach for parsing HTML. However, this way allows you to perform the task much faster than using libraries that build a DOM tree.
If it comes to regular expressions, you should understand that you can not build a universal and absolutely reliable solution on them. However, if you want to parse a specific site, this problem may not be so critical.
The license info.
The code for getting links from the page still looks clear:
public IEnumerable<string> Regex()
{
List<string> hrefTags = new List<string>();
Regex reHref = new Regex(@"(?inx)
<a \s [^>]*
href \s* = \s*
(?<q> ['""] )
(?<url> [^""]+ )
\k<q>
[^>]* >");
foreach (Match match in reHref.Matches(Html))
{
hrefTags.Add(match.Groups["url"].ToString());
}
return hrefTags;
}
If you suddenly want to parse tables with Regex, and even in a fancy format, please look here first.
Benchmark
Parser speed is, after all, one of the most important attributes. HTML parsing speed determines how long it will take you to finish a given task.
To measure parser performance I used the BenchmarkDotNet library from DreamWalker.
The measurements were made on an Intel® Core(TM) i9-9880H CPU @ 2.30GHz, but experience tells us that the relative time will be the same on any other configuration.
Regex is an excellent tool, but working with HTML is not the task of using it. As an experiment, however, I did try to implement a minimal working version of the code. While it works perfectly, the amount of time I spent programming suggests that I definitely wouldn't do it again.
Well, let's take a look at the benchmarks.
URL extraction from page links
This task seems to me to be basic for all parsers - more often than not, this is how an introduction to the world of parsers (sometimes Regex as well) begins.
As a scraping example, I've used the main page of ScrapingAnt:
The benchmark code can be found on Github, and there is a table with the results below:
Method | Mean | Error | Median |
---|---|---|---|
HtmlAgilityPack | 3.653 ms | 0.087 ms | 3.579 ms |
AngleSharp | 5.864 ms | 0.091 ms | 5.853 ms |
CsQuery | 14.269 ms | 0.284 ms | 13.931 ms |
Fizzler | 4.147 ms | 0.081 ms | 4.105 ms |
Regex | 0.547 ms | 0.010 ms | 0.543.0 ms |
Generally, Regex was expectedly the fastest but far from the most comfortable. HtmlAgilityPack and Fizzler showed approximately the same processing time, slightly ahead of AngleSharp. CsQuery, unfortunately, was hopelessly behind.
Data extraction from HTML table
This task is very common among some visitors to our site, as we provide a constantly updated list of free proxies for web scraping in the public domain.
The code of all the libraries is about the same, the only difference is the API.
However, there are two things worth mentioning: first, AngleSharp has specialized interfaces, which made the task easier. Second, Regex is not suitable for this task at all.
Method | Mean | Error | Median |
---|---|---|---|
HtmlAgilityPack | 3.323 ms | 0.0947 ms | 3.317 ms |
AngleSharp | 3.920 ms | 0.0557 ms | 3.929 ms |
CsQuery | 8.475 ms | 0.2227 ms | 8.400 ms |
Fizzler | 3.217 ms | 0.0637 ms | 3.205 ms |
Regex | 9.636 ms | 0.1904 ms | 9.456 ms |
As in the previous example, HtmlAgilityPack, AngleSharp, and Fizzler showed about the same and very good times.
To my surprise, CsQuery and Regex showed equally bad processing times. While everything is clear with CsQuery - it's just slow, with Regex it's not so clear - most likely the problem can be solved in a more optimal way.
Conclusion
The conclusions, probably, everyone has made for himself. However, I'd add that the best choice, for now, would be AngleSharp, because it's under active development, has an intuitive API, and shows good processing times.
Does it make sense to switch to AngleSharp from HtmlAgilityPack? Probably not - you can use Fizzler and enjoy a speedy and convenient library.
The benchmark code can be found here.
I suggest continuing with the following links to learn more:
- HTML Parsing Libraries - C# - quick HtmlAgilityPack vs AngleSharp review
- HTML Parsing Libraries - Java - Java HTML parsing libraries overview
- HTML Parsing Libraries - JavaScript - JavaScript HTML parsing libraries overview
Happy Web Scraping, and don't forget to organize your HTML parsing selectors 📚