红联Linux门户
Linux帮助

Apache Nutch 1.14发布,Web爬虫

发布时间:2017-12-27 08:59:31来源:红联作者:baihuo
Apache Nutch 1.14 发布了。Nutch是一个成熟的、可用于生产的 Web 爬虫。 Nutch 1.x 可以依靠 Apache Hadoop™ 数据结构进行细粒度配置,这对于批处理非常有用。

更新内容:

Bug 修复

A parser failure on a single document may fail crawling job

Classpath discrepancy with protocol-selenium in deploy mode

Clean not working after crawl

Nutch master docker container broken

CrawlDbReader -stats wrong values for earliest fetch time and shortest interval

Library conflict with Parser-Tika Plugin and Lib Folder

提升

Improving comments on the Injector Class

CrawlDB filtered documents counter.

Regex filter using case sensitive rules.

The crawl script should be able to skip an initial injection.

Ant Eclipse build does not include protocol-interactiveselenium

Upgrade feed parser plugin to use rome 1.5

软件详情:https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=10680&version=12340218

下载地址:http://nutch.apache.org/downloads.html

来自:开源中国社区
文章评论

共有 0 条评论