HTML is a simply structured markup language and everyone who is going to write a web scraper should deal with HTML parsing. The goal of this article is to help you find the right tool for HTML processing.
HTML is so popular that there is even a better option: using a library. It is better because it is easier to use and usually provides more features, such as a way to create an HTML document or support easy navigation through the parsed document. Usually, it comes with a CSS/jQuery-like selector to find nodes according to their position in the hierarchy.
The goal of this article is to help you to find the right tool for HTML processing in Java.
jsoup is a Java library for working with real-world HTML. It provides a very convenient API for fetching URLs and extracting and manipulating data, using the best of HTML5 DOM methods and CSS selectors.
jsoup is a Java-based library with a long history, but a modern attitude:
- it can handle old and bad HTML, but it also equipped for HTML5
- it has powerful support for manipulation, with support for CSS selectors, DOM Traversal, and easy addition or removal of HTML
- it can clean HTML, both to protect against XSS attacks and in the sense that it improves structure and formatting
It provides a very convenient API to extract and manipulate data, using the best of DOM, CSS, and jquery-like methods. It implements the WHATWG HTML5 specification and parses HTML to the same DOM as modern browsers do.
jsoup library provides following functionalities:
- Multiple Read Support − It reads and parses HTML using URL, file, or string.
- CSS Selectors − It can find and extract data, using DOM traversal or CSS selectors.
- DOM Manipulation − It can manipulate the HTML elements, attributes, and text.
- Prevent XSS attacks − It can clean user-submitted content against a given safe white-list, to prevent XSS attacks.
- Tidy − It outputs tidy HTML.
- Handles invalid data − jsoup can handle unclosed tags, implicit tags and can reliably create the document structure.
In this example, it directly fetches HTML documents from a URL and selects links. On line 9 you can also see the feature to automatically get the absolute URL even if the attribute
href references a local one. This is possible by using the proper setting, which is set implicitly when you fetch the URL with the
jsoup is a super-powered tool with a lot of features, so it's highly recommended going through the jsoup cookbook with following link: https://jsoup.org/cookbook/
The installation instructions are also available at the official site: https://jsoup.org/
HtmlCleaner is an open source HTML parser written in Java. HTML found on the Web is usually dirty, ill-formed and unsuitable for further processing. For any serious consumption of such documents, it is necessary to first clean up the mess and bring some order to the tags, attributes and ordinary text. For any given HTML document, HtmlCleaner reorders individual elements and produces well-formed XML. By default, it follows similar rules that the most of web browsers use in order to create the Document Object Model. However, you can provide custom tag and rule sets for tag filtering and balancing.
This explanation from the original site reveals that the project is old, but it is still updated and maintained. So the disadvantage of using HTMLCleaner is that the interface is a bit old and can be clunky when you need to manipulate HTML.
The advantage is that it works well even on old HTML documents. It can also write the documents in XML or pretty HTML (i.e., with the correct indentation). If you need JDOM and a product that supports XPath, or you even like XML, look no further.
The documentation offers a few examples and API documentation, but nothing more. The following example comes from it (but, of course, the original URL was changed)
Visit the original site to dive deeper into HTMLCleaner and get installation instructions: http://htmlcleaner.sourceforge.net/
While there might not always be that many choices, luckily there is always at least one of two good choices available for all the languages.
Also, don't forget to try out our web scraping API, which opens web pages in a real browser. The actual HTML documents out there might be in a wrong form, according to the standard, but they still work in the browser, so you'll always be able to execute JS on top of the HTML tree.