Web sites are written using HTML, which means that each web page is a structured document. Sometimes the goal is to obtain some data from them and preserve the structure while we’re at it. Websites don’t always provide their data in comfortable formats such as CSV or JSON, so only the way to deal with it is to parse the HTML page.
HTML is a simply structured markup language and everyone who is going to write a Web Scraper should deal with HTML parsing. The goal of this article is to help you find the right tool for HTML processing in C#.
AngleSharp follows the W3C specifications and gives you the same results as state of the art browsers. Besides the official API AngleSharp adds some useful extension methods on top. This makes working with the DOM convenient.
AngleSharp is quite simply the default choice for whenever you need a modern HTML parser for a C# project. In fact, it does not just parse HTML5, but also its most used companions: CSS and SVG. The main advantages of using AngleSharp are as follows:
- Performance: AngleSharp gives you a great performance to parse your favorite websites in practically no time.
- Standard-Driven: Everything works just like in modern browsers. From DOM construction to serialization.
- Interactive DOM: The DOM exposed by AngleSharp is fully functional and interactive, so you will even be able to handle DOM events in your code.
- Great Documentation: The whole code is documented with XML documentation, but you also can find the web version here: https://anglesharp.github.io/docs.html
AngleSharp constructs a DOM according to the official HTML5 specification. This also means that the resulting model is fully interactive and could be used for simple manipulation. The following example creates a document (like a virtual document, that you can easily get with using our Web Scraping API) and changes the tree structure by inserting another paragraph element with some text.
Check out more info at the official AngleSharp website: https://anglesharp.github.io/
HtmlAgilityPack is an HTML parser written in C# to read/write DOM and supports plain XPATH or XSLT.
This is an agile HTML parser that builds a read/write DOM and supports plain XPATH or XSLT (you actually don't HAVE to understand XPATH nor XSLT to use it, don't worry...). It is a .NET code library that allows you to parse "out of the web" HTML files. The parser is very tolerant of "real world" malformed HTML. The object model is very similar to what System.Xml proposes, but for HTML documents (or streams).
In terms of features and quality it is quite lacking, at least compared to AngleSharp. Support for CSS selector, necessary for modern HTML parsing, and support for .NET standard, necessary for modern C# projects, are on the roadmap. On the same README document there is also planned a cleanup of the code (https://github.com/zzzprojects/html-agility-pack).
So, in my opinion, HtmlAgilityPack is not such good option to start project with, but it's ok to continue working with this library and receive updates from the contribution team.
The usage of the library is quite straightforward, but let's check the sample:
Also, check out the online in-browser examples from the official website: https://html-agility-pack.net/online-examples
We have checked a few libraries and you might be surprised that, despite the popularity of HTML, there are usually few mature choices. That is because while HTML is very popular and structurally simple, providing support for all the multiple standards is hard work.