»ùÓÚApache NutchºÍHtmlunitµÄÀ©Õ¹ÊµÏÖAJAXÒ³ÃæÅÀ³æ×¥È¡½âÎö²å¼þ
֮ǰÌṩÁËÒ»¸ö°æ±¾£¬ÊÇÖ±½Ó°ÑpluginÐÎʽµÄÔ´Âë·Åµ½´úÂë¿â£¬ºóÀ´·¢ÏÖÓв»ÉÙÈË·´À¡Ëµ×Ô¼º¼¯³Éµ½apache nutchÖбàÒë»òÔËÐУ¬Óöµ½ÕâÄǵÄÎÊÌâ¡£Òò´ËÕâ´Î¸É´à»ùÓÚApache Nutch 1.8Ô´Â빤³Ì£¬°ÑËùÓвå¼þÔ´Âë/ÒÀÀµ/ÔËÐвÎÊýµÈÔ¤Öúã¬Ê¹´ó¼ÒÄܸü¼ò½àÈ«ÃæµÄʹÓÃÕâ¸ö²å¼þ¡£
http://www.oschina.net/p/nutch-htmlunit
http://git.oschina.net/xautlx/nutch-htmlunit
https://github.com/xautlx/nutch-htmlunit
Nutch Htmlunit Plugin
ÏîÄ¿¼ò½é
»ùÓÚApache Nutch 1.8ºÍHtmlunit×é¼þ£¬ÊµÏÖ¶ÔÓÚAJAX¼ÓÔØÀàÐÍÒ³ÃæµÄÍêÕûÒ³ÃæÄÚÈÝץȡ½âÎö¡£
According to the implementation of Apache Nutch 1.8, we can't get dynamic HTML information from fetch pages including AJAX requests as it will ignore all AJAX requests.
This plugin will use Htmlunit to fetch whole page content with necessary dynamic AJAX requests. It developed and tested with Apache Nutch 1.8, you can try it on other Nutch version or refactor the source codes as your design.
Ö÷ÒªÌØÐÔ
³£¹æµÄHTMLÒ³Ãæ×¥È¡: ¶ÔÓÚ³£¹æµÄÀýÈçÐÂÎÅÀàûÓÐAJAXÌØÐÔµÄÒ³Ãæ¿ÉÒÔÖ±½ÓÓÃNutch×Ô´øµÄprotocol-http²å¼þץȡ¡£
³£¹æµÄAJAXÒ³Ãæ×¥È¡: ¶ÔÓÚ¾ø´ó²¿·ÖÖîÈçjQuery ajax¼ÓÔØµÄÒ³Ãæ£¬¿ÉÒÔÖ±½ÓÓÃprotocol-htmlunit²å¼þץȡ¡£
ÌØÊâµÄAJAXÇëÇóÒ³Ãæ×¥È¡: ÖîÈçÌÔ±¦/ÌìèµÄÒ³Ãæ²ÉÓÃÁ˶ÀÌØµÄKissy Javascript×é¼þ£¬ µ¼ÖÂhtmlunitÎÞ·¨Ö±½Ó¸ÐÖªµ½ÐèÒªµÈ´ýKissy·¢ÆðµÄÇëÇóÍê³É£¬Í¨¹ýµÈ´ýÒ³Ãæ¼ÓÔØ½âÎöÄÚÈÝÅжϴ¦ÀíʵÏÖ´ËÀàÒ³ÃæÊý¾Ýץȡ¡£
»ùÓÚÒ³Ãæ¹ö¶¯µÄAJAXÇëÇóÒ³Ãæ×¥È¡: ÖîÈçÌÔ±¦/ÌìèµÄÉÌÆ·ÏêÇéÒ³Ãæ»á»ùÓÚÒ³Ãæ¹ö¶¯·¢ÆðÉÌÆ·ÃèÊöÐÅÏ¢µÄ¼ÓÔØ£¬ ͨ¹ýprotocol-htmlunitÀ©Õ¹´¦Àí¿ÉÒÔʵÏÖ´ËÀàÒ³ÃæÊý¾Ýץȡ¡£
ÔËÐÐÌåÑé
ÓÉÓÚNutchÔËÐÐÊÇ»ùÓÚUnix/Linux»·¾³µÄ£¬Çë×ÔÐÐ×¼±¸Unix/Linuxϵͳ»òCygwinÔËÐл·¾³¡£
git cloneÕû¸ö¹¤³Ì´úÂëºó£¬½øÐб¾µØgitÏÂÔØÄ¿Â¼£º
cd nutch-htmlunit/runtime/local
bin/crawl urls crawl false 1
//urls²ÎÊýΪÅÀ³æÈë¿âurlÎļþĿ¼; crawlΪÅÀ³æÊä³öĿ¼; false±¾Ó¦ÎªsolrË÷Òýurl²ÎÊý£¬´Ë´¦ÉèÖÃΪfalse²»×ösolrË÷Òý´¦Àí; 1ΪÅÀ³æÖ´ÐлØÊý
ÔËÐнáÊøºó¿ÉÒÔ¿´µ½ÌìèÉÌÆ·Ò³ÃæµÄ¼Û¸ñ/ÃèÊö/¹ö¶¯¼ÓÔØµÄͼƬµÈËùÓÐÐÅÏ¢¶¼ÒѾÍêÕû»ñÈ¡µ½¡£
ÔËÐÐÈÕÖ¾ÊäÈëʾÀý²Î¿¼£ºhttp://git.oschina.net/xautlx/nutch-htmlunit/wikis/Log
À©Õ¹²å¼þ˵Ã÷
protocol-htmlunit: »ùÓÚHtmlunitʵÏÖµÄAJAXÒ³ÃæFetcher²å¼þ
parse-s2jh: »ùÓÚXPath½âÎöÒ³ÃæÔªËØÄÚÈÝ; »ùÓÚÊý¾Ý¿âģʽÊä³ö½âÎöµ½½á¹¹»¯Êý¾Ý; ¶ÔÓÚ¸ö±ð¸´ÔÓÀàÐÍAJAXÒ³Ãæ¶¨ÖÆÅжÏÒ³Ãæ¼ÓÔØÍê³ÉµÄ»Øµ÷ÅжÏÂß¼
index-s2jh: ×·¼ÓÉèÖÃÐèÒª¶îÍâ´«µÝ¸øsolrË÷ÒýµÄÊôÐÔÊý¾Ý; É趨²»ÐèÒªË÷ÒýµÄÒ³Ãæ¹æÔò;
Èí¼þÏêÇ飺https://github.com/xautlx/nutch-htmlunit
ÏÂÔØµØÖ·£ºhttp://git.oschina.net/xautlx/nutch-htmlunit
À´×Ô:¿ªÔ´ÖйúÉçÇø

