9.7.10

Open Source HTML Parsers in Java

 
TagSoup

TagSoup is a SAX-compliant parser written in Java that, instead of parsing well-formed or valid XML, parses HTML as it is found in the wild: nasty and brutish, though quite often far from short. TagSoup is designed for people who have to process this stuff using some semblance of a rational application design. By providing a SAX interface, it allows standard XML tools to be applied to even the worst HTML.

HTML Parser

A fast real-time parser for real-world HTML.

Java HTML Parser

HTML Parser that produces a stream of tag objects, which can be further parsed into a searchable tree structure.

Cobra 0.95.1

Cobra is an HTML Toolkit that contains a pure Java HTML DOM parser and a rendering engine. It supports HTML 4, Javascript and CSS2.

HtmlCleaner

HtmlCleaner is open-source HTML parser written in Java. HTML found on Web is usually dirty, ill-formed and unsuitable for further processing. For any serious consumption of such documents, it is necessary to first clean up the mess and bring the order to tags, attributes and ordinary text. For the given HTML document, HtmlCleaner reorders individual elements and produces well-formed XML. By default, it follows similar rules that the most of web-browsers use in order to create document object model. However, user may provide custom tag and rule set for tag filtering and balancing.

Java Mozilla Html Parser

MozillaParser is a Java Html parser based on mozilla's html parser. it acts as a bridge from java classes to Mozilla's classes and outputs a java Document object from a raw ( and dirty) HTML input

HotSax

HotSAX is a fast, small footprint, non-validating SAX2 parser for HTML/XML/XHTML. It can be used in simple web agents, page scrapers, and spiders. It is similar to the Apache Xerces parser, except that it can generate SAX events for badly formatted HTML as well.

NekoHTML

NekoHTML is a simple HTML scanner and tag balancer that enables application programmers to parse HTML documents and access the information using standard XML interfaces. The parser can scan HTML files and "fix up" many common mistakes that human (and computer) authors make in writing HTML documents. NekoHTML adds missing parent elements; automatically closes elements with optional end tags; and can handle mismatched inline element tags.

Jericho HTML Parser

A simple but powerful java library for parsing and modifying HTML documents, including analysis of abritrary HTML forms to determine the structure of submitted data.

JTidy

JTidy is a Java port of HTML Tidy , a HTML syntax checker and pretty printer. Like its non-Java cousin, JTidy can be used as a tool for cleaning up malformed and faulty HTML. In addition, JTidy provides a DOM interface to the document that is being processed, which effectively makes you able to use JTidy as a DOM parser for real-world HTML.

VietSpider HTMLParser

VietSpider HTMLParser: Pure Java HTML DOM parser, support HTML 4.0.1. It is a fast, syntax checker, automatically closes elements with optional end tags; and can handle mismatched inline element tags.

No comments: