Apache Nutch 项目管理委员宣布 Apache Nutch 1.13 发布,建议所有当前的用户和 1.X 系列的开发人员升级到此版本。
Nutch是一个成熟的、可用于生产的 Web 爬虫。 Nutch 1.x 可以依靠 Apache Hadoop™ 数据结构进行细粒度配置,这对于批处理非常有用。
更新内容:
Sub-task
Refactor /seed endpoint for backward compatibility
Bug
Property 'indexer.delete.robots.noindex' not working when using parser-html.
lastModified not always set
Fix mrunit dependencies
urlnormalizer-basic to strip empty port
FetchItemQueue logs are logged with wrong class name
urlnormalizer-basic NPE for ill-formed URL "http:/"
Index metadata throw Exception because writable object cannot be cast to Text
Parsefilter-regex raises IndexOutOfBoundsException when rules are ill-formed
http.agent.rotate: IllegalArgumentException / last element of agent names ignored
Deprecated Job constructor in hostdb/ReadHostDb.java
改进
Add main() to ZipParser
Inconsistent 'Modified Time' in crawl db
Upgrade to elasticsearch 2.3.3
Upgrade to Hadoop 2.7.2
Utilize parameterized logging notation across Fetcher
Index checker server to optionally keep client connection open
CrawlDbReader -stats to show fetch time and interval
Indexer-elastic plugin should use Elasticsearch BulkProcessor and BackoffPolicy
Remove obsolete properties protocol.plugin.check.*
Fetcher to optionally save robots.txt
Seeds injected in REST workflow must be ingested into HDFS
Update Slf4j logging for Java 8 and upgrade miredot plugin version
SegmentReader to implement Tool
Log with Generic Class Name at Nutch 1.x
Protocol plugins to set cookie if Cookie metadata field is present
Get single record from HostDB
新特性
Publisher/Subscriber model for Nutch to emit events
Task
Upgrade Nutch Trunk to Java 1.8
下载地址:
http://nutch.apache.org/downloads.html
来自:开源中国社区

