HTML Parsing Libraries: C#

AngleSharp and HtmlAgilityPack: A Guide to Select the Right Library

Web sites are written using HTML, which means that each web page is a structured document. Sometimes it can be a goal to obtain some data from them and preserve the structure while we’re at it. Web sites don’t always provide their data in comfortable formats such as CSV or JSON, so only the way to deal with it – to parse the HTML page.

HTML is a simple structured markup language and everyone who is going to write the web scraper should deal with HTML parsing. The goal of this article is to help you to find the right tool for HTML processing in C#.

AngleSharp

AngleSharp follows the W3C specifications and gives you the same results as state of the art browsers. Besides the official API AngleSharp adds some useful extension methods on top. This makes working with the DOM convenient.

https://anglesharp.github.io/ – Official AngleSharp website

AngleSharp is quite simply the default choice for whenever you need a modern HTML parser for a C# project. In fact, it does not just parse HTML5, but also its most used companions: CSS and SVG. The main advantages of using AngleSharp:

  • Performance: AngleSharp gives you a great performance to parse your favorite websites in practically no-time.
  • Standard-Driven: Everything works just like in modern browsers. From DOM construction to serialization.
  • Interactive DOM: The DOM exposed by AngleSharp is fully functional and interactive, so you even be able to handle DOM events in your code.
  • Great Documentation: The whole code is documented with XML documentation, but you also can find the web version here: https://anglesharp.github.io/docs.html

There is also an extension to integrate scripting in the contest of parsing HTML documents: both C# and JavaScript, based on Jint. You can check it by following the link: https://github.com/AngleSharp/AngleSharp.Js

This means that you can parse HTML documents after they have been modified by JavaScript. Both the JavaScript included in the page or a script you add yourself.

AngleSharp constructs a DOM according to the official HTML5 specification. This also means that the resulting model is fully interactive and could be used for simple manipulation. The following example creates a document (like a virtual document, that you can easily get with using our Scraping API) and changes the tree structure by inserting another paragraph element with some text.


static async Task FirstExample()
{
    //Use the default configuration for AngleSharp
    var config = Configuration.Default;

    //Create a new context for evaluating webpages with the given config
    var context = BrowsingContext.New(config);

    //Parse the document from the content of a response to a virtual request
    var document = await context.OpenAsync(req => req.Content("<h1>Scraped HTML from ScrapingAnt</h1><p>This is a paragraph element"));

    //Do something with document like the following
    Console.WriteLine("Serializing the (scraped) document:");
    Console.WriteLine(document.DocumentElement.OuterHtml);

    var p = document.CreateElement("p");
    p.TextContent = "This is another paragraph to scraped document.";

    Console.WriteLine("Inserting another element in the body ...");
    document.Body.AppendChild(p);

    Console.WriteLine("Serializing the scraped document again:");
    Console.WriteLine(document.DocumentElement.OuterHtml);
}

Check out more info at the official AngleSharp website: https://anglesharp.github.io/

HtmlAgilityPack

HtmlAgilityPack is an HTML parser written in C# to read/write DOM and supports plain XPATH or XSLT.

https://html-agility-pack.net/ – Official HtmlAgilityPack website

This is an agile HTML parser that builds a read/write DOM and supports plain XPATH or XSLT (you actually don’t HAVE to understand XPATH nor XSLT to use it, don’t worry…). It is a .NET code library that allows you to parse “out of the web” HTML files. The parser is very tolerant with “real world” malformed HTML. The object model is very similar to what proposes System.Xml, but for HTML documents (or streams).

In terms of features and quality it is quite lacking, at least compared to AngleSharp. Support for CSS selector, necessary for modern HTML parsing, and support for .NET Standard, necessary for modern C# projects, are on the roadmap. On the same README document there is also planned a cleanup of the code (https://github.com/zzzprojects/html-agility-pack).

So, in my opinion, HtmlAgilityPack is not such good option to start project with, but it’s ok to continue working with this library and receive updates from the contribution team.

The usage of the library is quite straightforward, but let’s check the sample:


// @nuget: HtmlAgilityPack

using System;
using HtmlAgilityPack;
using System.Collections.Generic;

public class Program
{
	public static void Main()
	{
		// Load
		var doc = new HtmlDocument();
		doc.LoadHtml(@"<html><body><div id='foo'>I love ScrapingAnt <3</div></body></html>");
		var div = doc.GetElementbyId("foo"); 
		
		// Show info
		System.Console.WriteLine(div.OuterHtml);
		
		// Show info
		FiddleHelper.WriteTable(new List<string> () {div.OuterHtml });
		
		// Show info
		FiddleHelper.WriteTable(new List<HtmlAgilityPack.HtmlNode> () { div});
	}
}

Also, check out the online in-browser examples from the official website: https://html-agility-pack.net/online-examples

Conclusion

We have checked a few libraries and you might be surprised that, despite the popularity of HTML, there are usually few mature choices. That is because while HTML is very popular and structurally simple, providing support for all the multiple standards is hard work.

While there might not be that many choices as we've mentioned for JavaScript, but there is always at least one good choice to work with. And if you'd like to check how you can easily get the HTML to parse, please, try out our full-features Web Scraping API.

Close Bitnami banner
Bitnami