红联Linux门户
Linux帮助

Apache Nutch 1.1.3发布,Web爬虫

发布时间:2017-04-04 00:02:11来源:红联作者:lovsher
Apache Nutch 项目管理委员宣布 Apache Nutch 1.13 发布,建议所有当前的用户和 1.X 系列的开发人员升级到此版本。

Nutch是一个成熟的、可用于生产的 Web 爬虫。 Nutch 1.x 可以依靠 Apache Hadoop™ 数据结构进行细粒度配置,这对于批处理非常有用。

更新内容:

Sub-task

Refactor /seed endpoint for backward compatibility

Bug

Property 'indexer.delete.robots.noindex' not working when using parser-html.

lastModified not always set

Fix mrunit dependencies

urlnormalizer-basic to strip empty port

FetchItemQueue logs are logged with wrong class name

urlnormalizer-basic NPE for ill-formed URL "http:/"

Index metadata throw Exception because writable object cannot be cast to Text

Parsefilter-regex raises IndexOutOfBoundsException when rules are ill-formed

http.agent.rotate: IllegalArgumentException / last element of agent names ignored

Deprecated Job constructor in hostdb/ReadHostDb.java

改进

Add main() to ZipParser

Inconsistent 'Modified Time' in crawl db

Upgrade to elasticsearch 2.3.3

Upgrade to Hadoop 2.7.2

Utilize parameterized logging notation across Fetcher

Index checker server to optionally keep client connection open

CrawlDbReader -stats to show fetch time and interval

Indexer-elastic plugin should use Elasticsearch BulkProcessor and BackoffPolicy

Remove obsolete properties protocol.plugin.check.*

Fetcher to optionally save robots.txt

Seeds injected in REST workflow must be ingested into HDFS

Update Slf4j logging for Java 8 and upgrade miredot plugin version

SegmentReader to implement Tool

Log with Generic Class Name at Nutch 1.x

Protocol plugins to set cookie if Cookie metadata field is present

Get single record from HostDB

新特性

Publisher/Subscriber model for Nutch to emit events

Task

Upgrade Nutch Trunk to Java 1.8

下载地址:

http://nutch.apache.org/downloads.html

来自:开源中国社区
文章评论

共有 0 条评论