Skip to main content

HTML Parsing Libraries - C#

· 5 min read
Oleg Kulyk

HTML Parsing Libraries - C#

Web sites are written using HTML, which means that each web page is a structured document. Sometimes the goal is to obtain some data from them and preserve the structure while we’re at it. Websites don’t always provide their data in comfortable formats such as CSV or JSON, so only the way to deal with it is to parse the HTML page.

AngleSharp and HtmlAgilityPack: A Guide to Select the Right Library

HTML is a simply structured markup language and everyone who is going to write a Web Scraper should deal with HTML parsing. The goal of this article is to help you find the right tool for HTML processing in C#.

AngleSharp

AngleSharp follows the W3C specifications and gives you the same results as state of the art browsers. Besides the official API AngleSharp adds some useful extension methods on top. This makes working with the DOM convenient.

AngleSharp is quite simply the default choice for whenever you need a modern HTML parser for a C# project. In fact, it does not just parse HTML5, but also its most used companions: CSS and SVG. The main advantages of using AngleSharp are as follows:

  • Performance: AngleSharp gives you a great performance to parse your favorite websites in practically no time.
  • Standard-Driven: Everything works just like in modern browsers. From DOM construction to serialization.
  • Interactive DOM: The DOM exposed by AngleSharp is fully functional and interactive, so you will even be able to handle DOM events in your code.
  • Great Documentation: The whole code is documented with XML documentation, but you also can find the web version here: https://anglesharp.github.io/docs.html

There is also an extension to integrate scripting in the contest of parsing HTML documents: both C# and JavaScript, based on Jint. You can check it by following the link: https://github.com/AngleSharp/AngleSharp.Js

This means that you can parse HTML documents after they have been modified by JavaScript either from the JavaScript included in the page, or a script you add yourself.

AngleSharp constructs a DOM according to the official HTML5 specification. This also means that the resulting model is fully interactive and could be used for simple manipulation. The following example creates a document (like a virtual document, that you can easily get with using our Web Scraping API) and changes the tree structure by inserting another paragraph element with some text.

static async Task FirstExample()
{
//Use the default configuration for AngleSharp
var config = Configuration.Default;

//Create a new context for evaluating webpages with the given config
var context = BrowsingContext.New(config);

//Parse the document from the content of a response to a virtual request
var document = await context.OpenAsync(req => req.Content("<h1>Scraped HTML from ScrapingAnt</h1><p>This is a paragraph element</p>"));

//Do something with document like the following
Console.WriteLine("Serializing the (scraped) document:");
Console.WriteLine(document.DocumentElement.OuterHtml);

var p = document.CreateElement("p");
p.TextContent = "This is another paragraph to scraped document.";

Console.WriteLine("Inserting another element in the body ...");
document.Body.AppendChild(p);

Console.WriteLine("Serializing the scraped document again:");
Console.WriteLine(document.DocumentElement.OuterHtml);
}

Check out more info at the official AngleSharp website: https://anglesharp.github.io/

HtmlAgilityPack

HtmlAgilityPack is an HTML parser written in C# to read/write DOM and supports plain XPATH or XSLT.

This is an agile HTML parser that builds a read/write DOM and supports plain XPATH or XSLT (you actually don't HAVE to understand XPATH nor XSLT to use it, don't worry...). It is a .NET code library that allows you to parse "out of the web" HTML files. The parser is very tolerant of "real world" malformed HTML. The object model is very similar to what System.Xml proposes, but for HTML documents (or streams).

In terms of features and quality it is quite lacking, at least compared to AngleSharp. Support for CSS selector, necessary for modern HTML parsing, and support for .NET standard, necessary for modern C# projects, are on the roadmap. On the same README document there is also planned a cleanup of the code (https://github.com/zzzprojects/html-agility-pack).

So, in my opinion, HtmlAgilityPack is not such good option to start project with, but it's ok to continue working with this library and receive updates from the contribution team.

The usage of the library is quite straightforward, but let's check the sample:

// @nuget: HtmlAgilityPack

using System;
using HtmlAgilityPack;
using System.Collections.Generic;

public class Program
{
public static void Main()
{
// Load
var doc = new HtmlDocument();
doc.LoadHtml(@"<html><body><div id='foo'>I love ScrapingAnt <3</div></body></html>");
var div = doc.GetElementbyId("foo");

// Show info
System.Console.WriteLine(div.OuterHtml);

// Show info
FiddleHelper.WriteTable(new List<string> () {div.OuterHtml });

// Show info
FiddleHelper.WriteTable(new List<HtmlAgilityPack.HtmlNode> () { div});
}
}

Also, check out the online in-browser examples from the official website: https://html-agility-pack.net/online-examples

Conclusion

We have checked a few libraries and you might be surprised that, despite the popularity of HTML, there are usually few mature choices. That is because while HTML is very popular and structurally simple, providing support for all the multiple standards is hard work.

While there might not be that many choices as we've mentioned for JavaScript, but there is always at least one good choice to work with. And if you'd like to check how you can easily get the HTML to parse, please, try out our full-featured Web Scraping API.

Forget about getting blocked while scraping the Web

Try out ScrapingAnt Web Scraping API with thousands of proxy servers and an entire headless Chrome cluster