How to parse HTML in .NET

HTML parsing is a vital part of web scraping, as it allows convert web page content to meaningful and structured data. Still, as HTML is a tree-structured format, it requires a proper tool for parsing, as it can't be property traversed using Regex.

This article will reveal the most popular .NET libraries for HTML parsing with their strong and weak parts.

HTML parsing libraries

Let's have a quick review of the libraries with their licenses, nuances, etc.

HtmlAgilityPack

HtmlAgilityPack is one of the most (if not the most) famous HTML parsing libraries in the .NET world. As a result, many articles have been written about it.

In short, it is a fast, relatively handy library for working with HTML (assuming XPath queries are simple).

MIT License.

This parsing library will be convenient if the task is typical and well described by an XPath expression. For example, to get all the links from a page, we need very little code:

public IEnumerable<string> HtmlAgilityPackParse()
{
    HtmlDocument htmlSnippet = new HtmlDocument();
    htmlSnippet.LoadHtml(Html);

    List<string> hrefTags = new List<string>();

    foreach (HtmlNode link in htmlSnippet.DocumentNode.SelectNodes("//a[@href]"))
    {
        HtmlAttribute att = link.Attributes["href"];
        hrefTags.Add(att.Value);
    }

    return hrefTags;
}

Still, CSS classes usage is not convenient for this library and requires creating more complex expressions:

public IEnumerable<string> HtmlAgilityPackParse()
{
    HtmlDocument hap = new HtmlDocument();
    hap.LoadHtml(html);
    HtmlNodeCollection nodes = hap
        .DocumentNode
        .SelectNodes("//h3[contains(concat(' ', @class, ' '), ' r ')]/a");
    
    List<string> hrefTags = new List<string>();

    if (nodes != null)
    {
        foreach (HtmlNode node in nodes)
        {
            hrefTags.Add(node.GetAttributeValue("href", null));
        }
    }

    return hrefTags;
}

Of the observed oddities - a specific API, sometimes incomprehensible and confusing. However, the fact that the library is no longer abandoned adds enthusiasm and makes it a real alternative to AngleSharp.

AngleSharp

AngleSharp is written from scratch using C#.

The API is based on the official JavaScript HTML DOM specification. There are quirks in some places that are unusual for .NET developers (e.g., accessing an invalid index in a collection will return null instead of throwing an exception; there is a separate URL class; namespaces are very granular), but generally nothing critical.

MIT License.

The library code is clean, neat, and user-friendly.

For example, extracting links from the page is almost no different from Javascript and Python alternatives:

public IEnumerable<string> AngleSharpParse()
{
    List<string> hrefTags = new List<string>();

    var parser = new HtmlParser();
    var document = parser.Parse(Html);
    foreach (IElement element in document.QuerySelectorAll("a"))
    {
    hrefTags.Add(element.GetAttribute("href"));
    }

    return hrefTags;
}

CsQuery

CsQuery is a jQuery port for .NET. It implements all CSS2 & CSS3 selectors, all the DOM manipulation methods of jQuery, and some of the utility methods.

It was one of the modern HTML parsers for .NET. The library was based on the validator.nu parser for Java, which in turn is a port of the parser from the Gecko (Firefox) engine.

MIT license

Unfortunately, the project is abandoned by the author. Recommended alternative to it is AngleSharp.

The code for getting links from a page looks nice and familiar to anyone who has used jQuery:

public IEnumerable<string> CsQueryParse()
{
    List<string> hrefTags = new List<string>();

    CQ cq = CQ.Create(Html);
    foreach (IDomObject obj in cq.Find("a"))
    {
        hrefTags.Add(obj.GetAttribute("href"));
    }

    return hrefTags;
}

Fizzler

Fizzler is an add-on to HtmlAgilityPack (the Fizzler's implementation is based on HtmlAgilityPack), allowing you to use CSS selectors.

GNU GPL license.

Let's discover what problem solves Fizzler using the sample from the documentation:

// Load the document using HTMLAgilityPack as normal
var html = new HtmlDocument();
html.LoadHtml(@"
  <html>
      <head></head>
      <body>
        <div>
          <p class='content'>Fizzler</p>
          <p>CSS Selector Engine</p></div>
      </body>
  </html>");

// Fizzler for HtmlAgilityPack is implemented as the
// QuerySelectorAll extension method on HtmlNode

var document = html.DocumentNode;

// yields: [<p class="content">Fizzler</p>]
document.QuerySelectorAll(".content");

// yields: [<p class="content">Fizzler</p>,<p>CSS Selector Engine</p>]
document.QuerySelectorAll("p");

// yields empty sequence
document.QuerySelectorAll("body>p");

// yields [<p class="content">Fizzler</p>,<p>CSS Selector Engine</p>]
document.QuerySelectorAll("body p");

// yields [<p class="content">Fizzler</p>]
document.QuerySelectorAll("p:first-child");

It is almost the same speed as HtmlAgilityPack, but more convenient because of the CSS selectors.

Regex

Regex is ancient and not a good approach for parsing HTML. However, this way allows you to perform the task much faster than using libraries that build a DOM tree.

If it comes to regular expressions, you should understand that you can not build a universal and absolutely reliable solution on them. However, if you want to parse a specific site, this problem may not be so critical.

The license info.

The code for getting links from the page still looks clear:

public IEnumerable<string> Regex()
{
    List<string> hrefTags = new List<string>();

    Regex reHref = new Regex(@"(?inx)
    <a \s [^>]*
        href \s* = \s*
            (?<q> ['""] )
                (?<url> [^""]+ )
            \k<q>
    [^>]* >");
    
    foreach (Match match in reHref.Matches(Html))
    {
        hrefTags.Add(match.Groups["url"].ToString());
    }

    return hrefTags;
}

If you suddenly want to parse tables with Regex, and even in a fancy format, please look here first.

Benchmark

Parser speed is, after all, one of the most important attributes. HTML parsing speed determines how long it will take you to finish a given task.

To measure parser performance I used the BenchmarkDotNet library from DreamWalker.

The measurements were made on an Intel® Core(TM) i9-9880H CPU @ 2.30GHz, but experience tells us that the relative time will be the same on any other configuration.

note

Regex is an excellent tool, but working with HTML is not the task of using it. As an experiment, however, I did try to implement a minimal working version of the code. While it works perfectly, the amount of time I spent programming suggests that I definitely wouldn't do it again.

Well, let's take a look at the benchmarks.

URL extraction from page links

This task seems to me to be basic for all parsers - more often than not, this is how an introduction to the world of parsers (sometimes Regex as well) begins.

As a scraping example, I've used the main page of ScrapingAnt:

The benchmark code can be found on Github, and there is a table with the results below:

Method	Mean	Error	Median
HtmlAgilityPack	3.653 ms	0.087 ms	3.579 ms
AngleSharp	5.864 ms	0.091 ms	5.853 ms
CsQuery	14.269 ms	0.284 ms	13.931 ms
Fizzler	4.147 ms	0.081 ms	4.105 ms
Regex	0.547 ms	0.010 ms	0.543.0 ms

Generally, Regex was expectedly the fastest but far from the most comfortable. HtmlAgilityPack and Fizzler showed approximately the same processing time, slightly ahead of AngleSharp. CsQuery, unfortunately, was hopelessly behind.

Data extraction from HTML table

This task is very common among some visitors to our site, as we provide a constantly updated list of free proxies for web scraping in the public domain.

The code of all the libraries is about the same, the only difference is the API.

However, there are two things worth mentioning: first, AngleSharp has specialized interfaces, which made the task easier. Second, Regex is not suitable for this task at all.

Method	Mean	Error	Median
HtmlAgilityPack	3.323 ms	0.0947 ms	3.317 ms
AngleSharp	3.920 ms	0.0557 ms	3.929 ms
CsQuery	8.475 ms	0.2227 ms	8.400 ms
Fizzler	3.217 ms	0.0637 ms	3.205 ms
Regex	9.636 ms	0.1904 ms	9.456 ms

As in the previous example, HtmlAgilityPack, AngleSharp, and Fizzler showed about the same and very good times.

To my surprise, CsQuery and Regex showed equally bad processing times. While everything is clear with CsQuery - it's just slow, with Regex it's not so clear - most likely the problem can be solved in a more optimal way.

Conclusion

The conclusions, probably, everyone has made for himself. However, I'd add that the best choice, for now, would be AngleSharp, because it's under active development, has an intuitive API, and shows good processing times.

Does it make sense to switch to AngleSharp from HtmlAgilityPack? Probably not - you can use Fizzler and enjoy a speedy and convenient library.

The benchmark code can be found here.

I suggest continuing with the following links to learn more:

HTML Parsing Libraries - C# - quick HtmlAgilityPack vs AngleSharp review
HTML Parsing Libraries - Java - Java HTML parsing libraries overview
HTML Parsing Libraries - JavaScript - JavaScript HTML parsing libraries overview

Happy Web Scraping, and don't forget to organize your HTML parsing selectors 📚

How to parse HTML in .NET

HTML parsing libraries

HtmlAgilityPack

AngleSharp

CsQuery

Fizzler

Regex

Benchmark

URL extraction from page links

Data extraction from HTML table

Conclusion

Forget about getting blocked while scraping the Web

LLM-ready data extraction

HTML parsing libraries​

HtmlAgilityPack​

AngleSharp​

CsQuery​

Fizzler​

Regex​

Benchmark​

URL extraction from page links​

Data extraction from HTML table​

Conclusion​

Forget about getting blocked while scraping the Web

LLM-ready data extraction

HTML parsing libraries

HtmlAgilityPack

AngleSharp

CsQuery

Fizzler

Regex

Benchmark

URL extraction from page links

Data extraction from HTML table

Conclusion