Apache Nutch 1.14 发布了。Nutch是一个成熟的、可用于生产的 Web 爬虫。 Nutch 1.x 可以依靠 Apache Hadoop™ 数据结构进行细粒度配置,这对于批处理非常有用。
更新内容:
Bug 修复
A parser failure on a single document may fail crawling job
Classpath discrepancy with protocol-selenium in deploy mode
Clean not working after crawl
Nutch master docker container broken
CrawlDbReader -stats wrong values for earliest fetch time and shortest interval
Library conflict with Parser-Tika Plugin and Lib Folder
提升
Improving comments on the Injector Class
CrawlDB filtered documents counter.
Regex filter using case sensitive rules.
The crawl script should be able to skip an initial injection.
Ant Eclipse build does not include protocol-interactiveselenium
Upgrade feed parser plugin to use rome 1.5
软件详情:https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=10680&version=12340218
下载地址:http://nutch.apache.org/downloads.html
来自:开源中国社区

