ºìÁªLinuxÃÅ»§
Linux°ïÖú

Nutch-Htmlunit 1.8·¢²¼

·¢²¼Ê±¼ä:2014-08-08 09:32:52À´Ô´:ºìÁª×÷Õß:empast
»ùÓÚApache NutchºÍHtmlunitµÄÀ©Õ¹ÊµÏÖAJAXÒ³ÃæÅÀ³æ×¥È¡½âÎö²å¼þ

֮ǰÌṩÁËÒ»¸ö°æ±¾£¬ÊÇÖ±½Ó°ÑpluginÐÎʽµÄÔ´Âë·Åµ½´úÂë¿â£¬ºóÀ´·¢ÏÖÓв»ÉÙÈË·´À¡Ëµ×Ô¼º¼¯³Éµ½apache nutchÖбàÒë»òÔËÐУ¬Óöµ½ÕâÄǵÄÎÊÌâ¡£Òò´ËÕâ´Î¸É´à»ùÓÚApache Nutch 1.8Ô´Â빤³Ì£¬°ÑËùÓвå¼þÔ´Âë/ÒÀÀµ/ÔËÐвÎÊýµÈÔ¤Öúã¬Ê¹´ó¼ÒÄܸü¼ò½àÈ«ÃæµÄʹÓÃÕâ¸ö²å¼þ¡£

http://www.oschina.net/p/nutch-htmlunit

http://git.oschina.net/xautlx/nutch-htmlunit

https://github.com/xautlx/nutch-htmlunit

Nutch Htmlunit Plugin

ÏîÄ¿¼ò½é

»ùÓÚApache Nutch 1.8ºÍHtmlunit×é¼þ£¬ÊµÏÖ¶ÔÓÚAJAX¼ÓÔØÀàÐÍÒ³ÃæµÄÍêÕûÒ³ÃæÄÚÈÝץȡ½âÎö¡£

According to the implementation of Apache Nutch 1.8, we can't get dynamic HTML information from fetch pages including AJAX requests as it will ignore all AJAX requests.

This plugin will use Htmlunit to fetch whole page content with necessary dynamic AJAX requests. It developed and tested with Apache Nutch 1.8, you can try it on other Nutch version or refactor the source codes as your design.

Ö÷ÒªÌØÐÔ

³£¹æµÄHTMLÒ³Ãæ×¥È¡: ¶ÔÓÚ³£¹æµÄÀýÈçÐÂÎÅÀàûÓÐAJAXÌØÐÔµÄÒ³Ãæ¿ÉÒÔÖ±½ÓÓÃNutch×Ô´øµÄprotocol-http²å¼þץȡ¡£

³£¹æµÄAJAXÒ³Ãæ×¥È¡: ¶ÔÓÚ¾ø´ó²¿·ÖÖîÈçjQuery ajax¼ÓÔØµÄÒ³Ãæ£¬¿ÉÒÔÖ±½ÓÓÃprotocol-htmlunit²å¼þץȡ¡£

ÌØÊâµÄAJAXÇëÇóÒ³Ãæ×¥È¡: ÖîÈçÌÔ±¦/ÌìèµÄÒ³Ãæ²ÉÓÃÁ˶ÀÌØµÄKissy Javascript×é¼þ£¬ µ¼ÖÂhtmlunitÎÞ·¨Ö±½Ó¸ÐÖªµ½ÐèÒªµÈ´ýKissy·¢ÆðµÄÇëÇóÍê³É£¬Í¨¹ýµÈ´ýÒ³Ãæ¼ÓÔØ½âÎöÄÚÈÝÅжϴ¦ÀíʵÏÖ´ËÀàÒ³ÃæÊý¾Ýץȡ¡£

»ùÓÚÒ³Ãæ¹ö¶¯µÄAJAXÇëÇóÒ³Ãæ×¥È¡: ÖîÈçÌÔ±¦/ÌìèµÄÉÌÆ·ÏêÇéÒ³Ãæ»á»ùÓÚÒ³Ãæ¹ö¶¯·¢ÆðÉÌÆ·ÃèÊöÐÅÏ¢µÄ¼ÓÔØ£¬ ͨ¹ýprotocol-htmlunitÀ©Õ¹´¦Àí¿ÉÒÔʵÏÖ´ËÀàÒ³ÃæÊý¾Ýץȡ¡£

ÔËÐÐÌåÑé

ÓÉÓÚNutchÔËÐÐÊÇ»ùÓÚUnix/Linux»·¾³µÄ£¬Çë×ÔÐÐ×¼±¸Unix/Linuxϵͳ»òCygwinÔËÐл·¾³¡£

git cloneÕû¸ö¹¤³Ì´úÂëºó£¬½øÐб¾µØgitÏÂÔØÄ¿Â¼£º

cd nutch-htmlunit/runtime/local

bin/crawl urls crawl false 1

//urls²ÎÊýΪÅÀ³æÈë¿âurlÎļþĿ¼; crawlΪÅÀ³æÊä³öĿ¼; false±¾Ó¦ÎªsolrË÷Òýurl²ÎÊý£¬´Ë´¦ÉèÖÃΪfalse²»×ösolrË÷Òý´¦Àí; 1ΪÅÀ³æÖ´ÐлØÊý

ÔËÐнáÊøºó¿ÉÒÔ¿´µ½ÌìèÉÌÆ·Ò³ÃæµÄ¼Û¸ñ/ÃèÊö/¹ö¶¯¼ÓÔØµÄͼƬµÈËùÓÐÐÅÏ¢¶¼ÒѾ­ÍêÕû»ñÈ¡µ½¡£

ÔËÐÐÈÕÖ¾ÊäÈëʾÀý²Î¿¼£ºhttp://git.oschina.net/xautlx/nutch-htmlunit/wikis/Log

À©Õ¹²å¼þ˵Ã÷

protocol-htmlunit: »ùÓÚHtmlunitʵÏÖµÄAJAXÒ³ÃæFetcher²å¼þ

parse-s2jh: »ùÓÚXPath½âÎöÒ³ÃæÔªËØÄÚÈÝ; »ùÓÚÊý¾Ý¿âģʽÊä³ö½âÎöµ½½á¹¹»¯Êý¾Ý; ¶ÔÓÚ¸ö±ð¸´ÔÓÀàÐÍAJAXÒ³Ãæ¶¨ÖÆÅжÏÒ³Ãæ¼ÓÔØÍê³ÉµÄ»Øµ÷ÅжÏÂß¼­

index-s2jh: ×·¼ÓÉèÖÃÐèÒª¶îÍâ´«µÝ¸øsolrË÷ÒýµÄÊôÐÔÊý¾Ý; É趨²»ÐèÒªË÷ÒýµÄÒ³Ãæ¹æÔò;

Èí¼þÏêÇ飺https://github.com/xautlx/nutch-htmlunit

ÏÂÔØµØÖ·£ºhttp://git.oschina.net/xautlx/nutch-htmlunit

À´×Ô:¿ªÔ´ÖйúÉçÇø
ÎÄÕÂÆÀÂÛ

¹²ÓÐ 0 ÌõÆÀÂÛ