红联Linux门户
Linux帮助

Apache Tika 1.7发布,文本内容抽取集

发布时间:2015-01-17 09:32:23来源:红联作者:empast
Apache Tika 1.7 发布了,Tika是一个内容抽取的工具集合(a toolkit for text extracting)。它集成了POI, Pdfbox 并且为文本抽取工作提供了一个统一的界面。其次,Tika也提供了便利的扩展API,用来丰富其对第三方文件格式的支持。

该版本包含很多改进和 bug 修复,详细列表如下:
* Fixed resource leak in OutlookPSTParser that caused TikaException
when invoked via AutoDetectParser on Windows (TIKA-1506).

* HTML tags are properly stripped from content by FeedParser
(TIKA-1500).

* Tika Server support for selecting a single metadata key;
wrapped MetadataEP into MetadataResource (TIKA-1499).

* Tika Server support for JSON and XMP views of metadata (TIKA-1497).

* Tika Parent uses dependency management to keep duplicate
dependencies in different modules the same version (TIKA-1384).

* Upgraded slf4j to version 1.7.7 (TIKA-1496).

* Tika Server support for RecursiveParserWrapper's JSON output
(endpoint=rmeta) equivalent to (TIKA-1451's) -J option
in tika-app (TIKA-1498).

* Tika Server support for providing the password for files on a
per-request basis through the Password http header (TIKA-1494).

* Simple support for the BPG (Better Portable Graphics) image format
(TIKA-1491, TIKA-1495).

* Prevent exceptions from being thrown for some malformed
mp3 files (TIKA-1218).

* Reformat pom.xml files to use two spaces per indent (TIKA-1475).

* Fix warning of slf4j logger on Tika Server startup (TIKA-1472).

* Tika CLI and GUI now have option to view JSON rendering of output
of RecursiveParserWrapper (TIKA-1451).

* Tika now integrates the Geospatial Data Abstraction Library
(GDAL) for parsing hundreds of geospatial formats (TIKA-605,
TIKA-1503).

* ExternalParsers can now use Regexs to specify dynamic keys
(TIKA-1441).

* Thread safety issues in ImageMetadataExtractor were resolved
(TIKA-1369).

* The ForkParser service is now registered in Activator
(TIKA-1354).

* The Rome Library was upgraded to version 1.5 (TIKA-1435).

* Add markup for files embedded in PDFs (TIKA-1427).

* Extract files embedded in annotations in PDFS (TIKA-1433).

* Upgrade to PDFBox 1.8.8 (TIKA-1419, TIKA-1442).

* Add RecursiveParserWrapper (aka Jukka's and Nick's)
RecursiveMetadataParser (TIKA-1329)

* Add example for how to dump TikaConfig to XML (TIKA-1418).

* Allow users to specify a tika config file for tika-app (TIKA-1426).

* PackageParser includes the last-modified date from the archive
in the metadata, when handling embedded entries (TIKA-1246)

* Created a new Tesseract OCR Parser to extract text from images.
Requires installation of Tesseract before use (TIKA-93).

* Basic parser for older Excel formats, such as Excel 4, 5 and 95,
which can get simple text, and metadata for Excel 5+95 (TIKA-1490)

软件详情:http://tika.apache.org/

下载地址:http://tika.apache.org/download.html

来自:开源中国社区
文章评论

共有 0 条评论