HTML Parsing Libraries: Java

How to parse HTML with popular Java libraries: HTMLCleaner and Jsoup

HTML is a simple structured markup language and everyone who is going to write the web scraper should deal with HTML parsing. The goal of this article is to help you to find the right tool for HTML processing.

HTML is so popular that there is even a better option: using a library. It is better because it is easier to use and usually provides more features, such as a way to create an HTML document or support easy navigation through the parsed document. Usually, it comes with a CSS/jQuery-like selector to find nodes according to their position in the hierarchy.

The goal of this article is to help you to find the right tool for HTML processing in Java.

Jsoup

jsoup is a Java library for working with real-world HTML. It provides a very convenient API for fetching URLs and extracting and manipulating data, using the best of HTML5 DOM methods and CSS selectors.

https://jsoup.org/ – Official jsoup site

jsoup is a Java-based library with a long history, but a modern attitude:

  • it can handle old and bad HTML, but it also equipped for HTML5
  • it has powerful support for manipulation, with support for CSS selectors, DOM Traversal, and easy addition or removal of HTML
  • it can clean HTML, both to protect against XSS attacks and in the sense that it improves structure and formatting

It provides a very convenient API to extract and manipulate data, using the best of DOM, CSS, and jquery-like methods. It implements the WHATWG HTML5 specification and parses HTML to the same DOM as modern browsers do.

jsoup library provides following functionalities.

  • Multiple Read Support − It reads and parses HTML using URL, file, or string.
  • CSS Selectors − It can find and extract data, using DOM traversal or CSS selectors.
  • DOM Manipulation − It can manipulate the HTML elements, attributes, and text.
  • Prevent XSS attacks − It can clean user-submitted content against a given safe white-list, to prevent XSS attacks.
  • Tidy − It outputs tidy HTML.
  • Handles invalid data − jsoup can handle unclosed tags, implicit tags and can reliably create the document structure.

In this example, it directly fetches HTML documents from an URL and selects links. On line 9 you can also see the feature to automatically get the absolute URL even if the attribute href reference a local one. This is possible by using the proper setting, which is set implicitly when you fetch the URL with the connect method.


Document doc = Jsoup.connect("https://scrapingant.com/")
               .userAgent("Mozilla")
               .get();

Elements newsHeadlines = doc.select("#mp-itn b a");

print("nLinks: (%d)", newsHeadlines.size());
for (Element link : newsHeadlines) {
   print(" * a: <%s>  (%s)", link.attr("abs:href"), trim(link.text(), 35));
}

jsoup is a super-powered tool with a lot of features, so it’s highly recommended to go through the jsoup cookbook by the following link: https://jsoup.org/cookbook/

The installation instructions are also available at the official site: https://jsoup.org/

HTMLCleaner

HtmlCleaner is an open source HTML parser written in Java. HTML found on the Web is usually dirty, ill-formed and unsuitable for further processing. For any serious consumption of such documents, it is necessary to first clean up the mess and bring some order to the tags, attributes and ordinary text. For any given HTML document, HtmlCleaner reorders individual elements and produces well-formed XML. By default, it follows similar rules that the most of web browsers use in order to create the Document Object Model. However, you can provide custom tag and rule sets for tag filtering and balancing.

http://htmlcleaner.sourceforge.net/ – Official HTMLCleaner site

This explanation from the original site reveals that the project is old, but it is still updated and maintained. So the disadvantage of using HTMLCleaner is that the interface is a bit old and can be clunky when you need to manipulate HTML.

The advantage is that it works well even on old HTML documents. It can also write the documents in XML or pretty HTML (i.e., with the correct indentation). If you need JDOM and a product that supports XPath, or you even like XML, look no further.

The documentation offers a few examples and API documentation, but nothing more. The following example comes from it (but, of course, original URL was changed 😀)


HtmlCleaner cleaner = new HtmlCleaner();
final String siteUrl = "https://scrapingant.com/";
 
TagNode node = cleaner.clean(new URL(siteUrl));
 
// traverse whole DOM and update images to absolute URLs
node.traverse(new TagNodeVisitor() {
    public boolean visit(TagNode tagNode, HtmlNode htmlNode) {
        if (htmlNode instanceof TagNode) {
            TagNode tag = (TagNode) htmlNode;
            String tagName = tag.getName();
            if ("img".equals(tagName)) {
                String src = tag.getAttributeByName("src");
                if (src != null) {
                    tag.setAttribute("src", Utils.fullUrl(siteUrl, src));
                }
            }
        } else if (htmlNode instanceof CommentNode) {
            CommentNode comment = ((CommentNode) htmlNode); 
            comment.getContent().append(" -- By HtmlCleaner");
        }
        // tells visitor to continue traversing the DOM tree
        return true;
    }
});
 
SimpleHtmlSerializer serializer = 
    new SimpleHtmlSerializer(cleaner.getProperties());
serializer.writeToFile(node, "c:/temp/scrapingant.html");

Visit the original site to dive deeper into HTMLCleaner and get installation instructions: http://htmlcleaner.sourceforge.net/

Conclusion

While there might not always be that many choices, luckily there is always at least one of two good choices available for all the languages.

Also, don’t forget to try out our scraping solution, that opens web pages in real browser. The actual HTML documents out there might be in a wrong form, according to the standard, but they still work in the browser, so you always be able to execute JS on top of the HTML tree.

Close Bitnami banner
Bitnami