html5lib 是一个 Ruby 和 Python 用来解析 HTML 文档的类库,支持HTML 5 以及最大程度兼容桌面浏览器。
主要特性包括:
Parses valid and invalid HTML documents to a tree
Support for minidom, ElementTree (including cElementTree and lxml.etree), BeautifulSoup and custom simpletree output formats
DOM to SAX converter
Reports parse errors
Character encoding detection
XML mode for working with illformed XML e.g. feeds
Filtering and serializing of trees
HTML+CSS sanitizer
Many unit tests
Faster than before
项目主页:http://code.google.com/p/html5lib/
下载地址:http://code.google.com/p/html5lib/downloads/list
来自:开源中国社区