Apache Tika 1.7 发布了,Tika是一个内容抽取的工具集合(a toolkit for text extracting)。它集成了POI, Pdfbox 并且为文本抽取工作提供了一个统一的界面。其次,Tika也提供了便利的扩展API,用来丰富其对第三方文件格式的支持。
该版本包含很多改进和 bug 修复,详细列表如下:
* Fixed resource leak in OutlookPSTParser that caused TikaException
when invoked via AutoDetectParser on Windows (TIKA-1506).
* HTML tags are properly stripped from content by FeedParser
(TIKA-1500).
* Tika Server support for selecting a single metadata key;
wrapped MetadataEP into MetadataResource (TIKA-1499).
* Tika Server support for JSON and XMP views of metadata (TIKA-1497).
* Tika Parent uses dependency management to keep duplicate
dependencies in different modules the same version (TIKA-1384).
* Upgraded slf4j to version 1.7.7 (TIKA-1496).
* Tika Server support for RecursiveParserWrapper's JSON output
(endpoint=rmeta) equivalent to (TIKA-1451's) -J option
in tika-app (TIKA-1498).
* Tika Server support for providing the password for files on a
per-request basis through the Password http header (TIKA-1494).
* Simple support for the BPG (Better Portable Graphics) image format
(TIKA-1491, TIKA-1495).
* Prevent exceptions from being thrown for some malformed
mp3 files (TIKA-1218).
* Reformat pom.xml files to use two spaces per indent (TIKA-1475).
* Fix warning of slf4j logger on Tika Server startup (TIKA-1472).
* Tika CLI and GUI now have option to view JSON rendering of output
of RecursiveParserWrapper (TIKA-1451).
* Tika now integrates the Geospatial Data Abstraction Library
(GDAL) for parsing hundreds of geospatial formats (TIKA-605,
TIKA-1503).
* ExternalParsers can now use Regexs to specify dynamic keys
(TIKA-1441).
* Thread safety issues in ImageMetadataExtractor were resolved
(TIKA-1369).
* The ForkParser service is now registered in Activator
(TIKA-1354).
* The Rome Library was upgraded to version 1.5 (TIKA-1435).
* Add markup for files embedded in PDFs (TIKA-1427).
* Extract files embedded in annotations in PDFS (TIKA-1433).
* Upgrade to PDFBox 1.8.8 (TIKA-1419, TIKA-1442).
* Add RecursiveParserWrapper (aka Jukka's and Nick's)
RecursiveMetadataParser (TIKA-1329)
* Add example for how to dump TikaConfig to XML (TIKA-1418).
* Allow users to specify a tika config file for tika-app (TIKA-1426).
* PackageParser includes the last-modified date from the archive
in the metadata, when handling embedded entries (TIKA-1246)
* Created a new Tesseract OCR Parser to extract text from images.
Requires installation of Tesseract before use (TIKA-93).
* Basic parser for older Excel formats, such as Excel 4, 5 and 95,
which can get simple text, and metadata for Excel 5+95 (TIKA-1490)
软件详情:http://tika.apache.org/
下载地址:http://tika.apache.org/download.html
来自:开源中国社区

