Skip to main content

How to parse HTML in .NET

· 8 min read
Oleg Kulyk

How to parse HTML in .NET

HTML parsing is a vital part of web scraping, as it allows convert web page content to meaningful and structured data. Still, as HTML is a tree-structured format, it requires a proper tool for parsing, as it can't be property traversed using Regex.

This article will reveal the most popular .NET libraries for HTML parsing with their strong and weak parts.

HTML parsing libraries

Let's have a quick review of the libraries with their licenses, nuances, etc.

HtmlAgilityPack

HtmlAgilityPack is one of the most (if not the most) famous HTML parsing libraries in the .NET world. As a result, many articles have been written about it.

In short, it is a fast, relatively handy library for working with HTML (assuming XPath queries are simple).

MIT License.

This parsing library will be convenient if the task is typical and well described by an XPath expression. For example, to get all the links from a page, we need very little code:

public IEnumerable<string> HtmlAgilityPackParse()
{
HtmlDocument htmlSnippet = new HtmlDocument();
htmlSnippet.LoadHtml(Html);

List<string> hrefTags = new List<string>();

foreach (HtmlNode link in htmlSnippet.DocumentNode.SelectNodes("//a[@href]"))
{
HtmlAttribute att = link.Attributes["href"];
hrefTags.Add(att.Value);
}

return hrefTags;
}

Still, CSS classes usage is not convenient for this library and requires creating more complex expressions:

public IEnumerable<string> HtmlAgilityPackParse()
{
HtmlDocument hap = new HtmlDocument();
hap.LoadHtml(html);
HtmlNodeCollection nodes = hap
.DocumentNode
.SelectNodes("//h3[contains(concat(' ', @class, ' '), ' r ')]/a");

List<string> hrefTags = new List<string>();

if (nodes != null)
{
foreach (HtmlNode node in nodes)
{
hrefTags.Add(node.GetAttributeValue("href", null));
}
}

return hrefTags;
}

Of the observed oddities - a specific API, sometimes incomprehensible and confusing. However, the fact that the library is no longer abandoned adds enthusiasm and makes it a real alternative to AngleSharp.

AngleSharp

AngleSharp is written from scratch using C#.

The API is based on the official JavaScript HTML DOM specification. There are quirks in some places that are unusual for .NET developers (e.g., accessing an invalid index in a collection will return null instead of throwing an exception; there is a separate URL class; namespaces are very granular), but generally nothing critical.

MIT License.

The library code is clean, neat, and user-friendly.

For example, extracting links from the page is almost no different from Javascript and Python alternatives:

public IEnumerable<string> AngleSharpParse()
{
List<string> hrefTags = new List<string>();

var parser = new HtmlParser();
var document = parser.Parse(Html);
foreach (IElement element in document.QuerySelectorAll("a"))
{
hrefTags.Add(element.GetAttribute("href"));
}

return hrefTags;
}

CsQuery

CsQuery is a jQuery port for .NET. It implements all CSS2 & CSS3 selectors, all the DOM manipulation methods of jQuery, and some of the utility methods.

It was one of the modern HTML parsers for .NET. The library was based on the validator.nu parser for Java, which in turn is a port of the parser from the Gecko (Firefox) engine.

MIT license

Unfortunately, the project is abandoned by the author. Recommended alternative to it is AngleSharp.

The code for getting links from a page looks nice and familiar to anyone who has used jQuery:

public IEnumerable<string> CsQueryParse()
{
List<string> hrefTags = new List<string>();

CQ cq = CQ.Create(Html);
foreach (IDomObject obj in cq.Find("a"))
{
hrefTags.Add(obj.GetAttribute("href"));
}

return hrefTags;
}

Fizzler

Fizzler is an add-on to HtmlAgilityPack (the Fizzler's implementation is based on HtmlAgilityPack), allowing you to use CSS selectors.

GNU GPL license.

Let's discover what problem solves Fizzler using the sample from the documentation:

// Load the document using HTMLAgilityPack as normal
var html = new HtmlDocument();
html.LoadHtml(@"
<html>
<head></head>
<body>
<div>
<p class='content'>Fizzler</p>
<p>CSS Selector Engine</p></div>
</body>
</html>");

// Fizzler for HtmlAgilityPack is implemented as the
// QuerySelectorAll extension method on HtmlNode

var document = html.DocumentNode;

// yields: [<p class="content">Fizzler</p>]
document.QuerySelectorAll(".content");

// yields: [<p class="content">Fizzler</p>,<p>CSS Selector Engine</p>]
document.QuerySelectorAll("p");

// yields empty sequence
document.QuerySelectorAll("body>p");

// yields [<p class="content">Fizzler</p>,<p>CSS Selector Engine</p>]
document.QuerySelectorAll("body p");

// yields [<p class="content">Fizzler</p>]
document.QuerySelectorAll("p:first-child");

It is almost the same speed as HtmlAgilityPack, but more convenient because of the CSS selectors.

Regex

Regex is ancient and not a good approach for parsing HTML. However, this way allows you to perform the task much faster than using libraries that build a DOM tree.

If it comes to regular expressions, you should understand that you can not build a universal and absolutely reliable solution on them. However, if you want to parse a specific site, this problem may not be so critical.

The license info.

The code for getting links from the page still looks clear:

public IEnumerable<string> Regex()
{
List<string> hrefTags = new List<string>();

Regex reHref = new Regex(@"(?inx)
<a \s [^>]*
href \s* = \s*
(?<q> ['""] )
(?<url> [^""]+ )
\k<q>
[^>]* >");

foreach (Match match in reHref.Matches(Html))
{
hrefTags.Add(match.Groups["url"].ToString());
}

return hrefTags;
}

If you suddenly want to parse tables with Regex, and even in a fancy format, please look here first.

Benchmark

Parser speed is, after all, one of the most important attributes. HTML parsing speed determines how long it will take you to finish a given task.

To measure parser performance I used the BenchmarkDotNet library from DreamWalker.

The measurements were made on an Intel® Core(TM) i9-9880H CPU @ 2.30GHz, but experience tells us that the relative time will be the same on any other configuration.

note

Regex is an excellent tool, but working with HTML is not the task of using it. As an experiment, however, I did try to implement a minimal working version of the code. While it works perfectly, the amount of time I spent programming suggests that I definitely wouldn't do it again.

Well, let's take a look at the benchmarks.

This task seems to me to be basic for all parsers - more often than not, this is how an introduction to the world of parsers (sometimes Regex as well) begins.

As a scraping example, I've used the main page of ScrapingAnt:

The benchmark code can be found on Github, and there is a table with the results below:

MethodMeanErrorMedian
HtmlAgilityPack3.653 ms0.087 ms3.579 ms
AngleSharp5.864 ms0.091 ms5.853 ms
CsQuery14.269 ms0.284 ms13.931 ms
Fizzler4.147 ms0.081 ms4.105 ms
Regex0.547 ms0.010 ms0.543.0 ms

Generally, Regex was expectedly the fastest but far from the most comfortable. HtmlAgilityPack and Fizzler showed approximately the same processing time, slightly ahead of AngleSharp. CsQuery, unfortunately, was hopelessly behind.

Data extraction from HTML table

This task is very common among some visitors to our site, as we provide a constantly updated list of free proxies for web scraping in the public domain.

The code of all the libraries is about the same, the only difference is the API.

However, there are two things worth mentioning: first, AngleSharp has specialized interfaces, which made the task easier. Second, Regex is not suitable for this task at all.

MethodMeanErrorMedian
HtmlAgilityPack3.323 ms0.0947 ms3.317 ms
AngleSharp3.920 ms0.0557 ms3.929 ms
CsQuery8.475 ms0.2227 ms8.400 ms
Fizzler3.217 ms0.0637 ms3.205 ms
Regex9.636 ms0.1904 ms9.456 ms

As in the previous example, HtmlAgilityPack, AngleSharp, and Fizzler showed about the same and very good times.

To my surprise, CsQuery and Regex showed equally bad processing times. While everything is clear with CsQuery - it's just slow, with Regex it's not so clear - most likely the problem can be solved in a more optimal way.

Conclusion

The conclusions, probably, everyone has made for himself. However, I'd add that the best choice, for now, would be AngleSharp, because it's under active development, has an intuitive API, and shows good processing times.

Does it make sense to switch to AngleSharp from HtmlAgilityPack? Probably not - you can use Fizzler and enjoy a speedy and convenient library.

The benchmark code can be found here.

I suggest continuing with the following links to learn more:

Happy Web Scraping, and don't forget to organize your HTML parsing selectors 📚

Forget about getting blocked while scraping the Web

Try out ScrapingAnt Web Scraping API with thousands of proxy servers and an entire headless Chrome cluster