½ñÌìÔÚNutchÔ´ÂëÖÐ×¼±¸Ôö¼ÓÒ»¸öPDF´¦Àí·½ÃæµÄ¹¦ÄÜ£¬ÆäÖÐÒª×öµÄÒ»²½ÊÇÌáÈ¡³öPDFÎĵµÖеÄÎı¾ÐÅÏ¢¡£¿¼ÂÇÁËһϣ¬»¹ÊÇ×¼±¸Ê¹ÓÃPDFBox¡£¿´ÁËһϣ¬NutchÔ´ÂëÖеÄparse-tika²å¼þÏÂÓÐÒ»¸öPDFBox£¬²»¹ýÊÇ1.1.0°æ±¾£¬ºÜ¶àPDFÎĵµ¶¼´¦Àí²»ÁË¡£ÏÖÔÚ¹ÙÍøÉÏ×îеİ汾ÒѾÊÇ1.6.0ÁË£¬ÓÚÊÇ×¼±¸Ì滻һϡ£ÓÉÓÚ×Ô¼º²»Ï²»¶¿´Ó¢ÎÄ˵Ã÷£¬ÔÚŪµÄʱºòµ¹ÊÇ·ÑÁËÒ»·¬ÖÜÕÛ¡£
ÎÒÒ»¿ªÊ¼Ö»ÏÂÔØÁËpdfbox-1.6.0.jar£¬Ìæ»»ÁËÀϰ汾µÄjar°ü£¬³ÌÐò±¨´í¡£ÎÞÄÎ֮ϣ¬×Ðϸ¿´ÁËһϹٷ½Îĵµ¡£PDFBoxµÄ¹ÙÍø£¨http://pdfbox.apache.org/£©ÉϵÄdepandenciesÒ»À¸ÖÐÃ÷È·Ö¸³öÁËʹÓÃPDFBoxËùÐèµÄ×é¼þ¼°Æä¹ØÁª¡£PDFBox¹²ÓÐÈý¸öÖ÷Òª×é¼þ£¬³ýÁËÉÏÃæµÄpdfbox-1.6.0.jar£¬»¹ÓÐfontbox-1.6.0.jarÓëjempbox-1.6.0.jar£¬´ËÍ⻹ÐèÒªÒ»¸öÈÕÖ¾´¦ÀíµÄcommons-logging×é¼þ¡£¶ÔÓÚÈÕÖ¾×é¼þ£¬NutchÖб¾À´¾ÍÓУ¬ÊÇcommons-logging-1.0.4.jarºÍ
commons-logging-api-1.0.4.jar£¬Èç¹ûÄãÔÚ×Ô¼ºµÄÓ¦ÓóÌÐòÖÐʹÓÃPDFBox£¬¾ÍÐèÒªÉÏÃæÕâÎå¸öjar°ü£¨ÈÕÖ¾×é¼þÊÇÁ½¸öjar°ü£©¡£
µ±È»£¬¹ÙÍøÉÏΪÁË·½±ãÓû§Ê¹Ó㬻¹ÌṩÁËÒ»¸ö¼¯³ÉµÄjar°ü£ºpdfbox-app-1.6.0.jar£¬Èç¹ûʹÓøÃjar°ü£¬¾Í²»ÔÙÐèÒªÆäËûµÄÁË¡£
OK£¬Ò»ÇоÍÐ÷ºó£¬¾Í¿ªÊ¼ÌáÈ¡Îı¾ÐÅÏ¢¡£ÌáÈ¡Îı¾ÐÅÏ¢µÄ´úÂë±È½Ï¼òµ¥£¬ÍøÉÏÒ²ÓÐÐí¶à¡£Ê¾ÀýÈçÏ£º
PDDocument doc = PDDocument.load("D:/331.pdf");
PDFTextStripper stripper = new PDFTextStripper();
String text = stripper.getText(doc);
String title = stripper.getTitle(doc);
ÕâÊÇ´Ó±¾µØ¶ÁÈ¡pdfÎļþ£¬Èç¹ûÊÇ´ÓÍøÂçÉÏ£¬ÄãÊ×ÏÈ»áµÃµ½ÎļþµÄÒ»¸öInputStream¶ÔÏ󣨼ÙÉèÃûΪstream£©£¬´úÂëÈçÏ£º
PDDocument doc = new PDDocument();
PDFParser parser = new PDFParser(stream);
parser.parse();
doc = parser.getPDDocument();
PDFTextStripper stripper = new PDFTextStripper();
String text = stripper.getText(doc);
String title = stripper.getTitle(doc);
µ«ÒªËµÃ÷;
(1) PDFBox¶ÔijЩ¸ñʽµÄpdfÎļþÊÇÌáÈ¡²»³öÀ´µÄ£¬µ«´ó²¿·Ö¶¼¿ÉÒÔ¡£
(2) PDFTextStripper³¢ÊÔÌáÈ¡³ö¸ü¶àµÄÐÅÏ¢£¬±ÈÈç±êÌ⣬ժҪµÈ£»µ«²»Òª¹ý¶àÖ¸Íû¸ÃÀֻ࣬ÓÐÄÇЩ¹æ·¶µÄPDFÎĵµ£¨ÂÛÎÄÄÇÖÖ£©£¬²Å¿ÉÒÔÌáÈ¡³öÀ´¡£ÆäÓàµÄҪôÊÇnull£¬ÒªÃ´ÊÇ´íÎóµÄ¡£
PDFBox»¹ÓкܶàÆäËûµÄ¹¦ÄÜ£¬±ÈÈçÊÔ׎âÂëµÈ£¬Èç¹ûÐèÒª£¬¾ÍÈ¥Ñо¿API°É¡¡
À´Ô´£ºLinuxÉçÇø

