ºìÁªLinuxÃÅ»§
Linux°ïÖú

Java ÅÀ³æ¿ò¼Ü£ºseimicrawler v0.2.7·¢²¼

·¢²¼Ê±¼ä:2016-01-17 09:15:09À´Ô´:ºìÁª×÷Õß:baihuo
Change log

v0.2.7

ÄÚǶhttp½Ó¿ÚÔÚ¿ÉÒÔ½ÓÊÕµ¥¸öJsonÐÎʽRequest»ù´¡ÉÏÔö¼ÓÖ§³Ö½ÓÊÕJsonÊý×éÐÎʽµÄ¶à¸öRequest

Request¶ÔÏóÖ§³ÖÉèÖÃskipDuplicateFilterÓÃÀ´¸æËßseimi´¦ÀíÆ÷Ìø¹ýÈ¥ÖØ»úÖÆ£¬Ä¬Èϲ»Ìø¹ý

Ôö¼Ó¶¨Ê±µ÷¶ÈʹÓÃDemo

»Øµ÷º¯Êýͨ¹ýRequest´«µÝ×Ô¶¨Òå²ÎÊýÖµÀàÐÍÓÉObject¸ÄΪString£¬·½±ãÃ÷È·´¦Àí

Fix:ÐÞ¸´Ò»¸ö´òÈÕÖ¾µÄbug

v0.2.6

Ôö¼ÓͳһµÄÆô¶¯Èë¿ÚÀ࣬ÅäºÏδÀ´SeimiCrawlerµÄmaven¹¹½¨pluginÒ»ÆðʹÓÃ

meta refresh·½Ê½Ìø×ªÓÅ»¯£¬ÉèÖÃ×î¶àÉÏÏÞΪ3´Î£¬·ÀÖ¹Óöµ½³ÖÐøË¢ÐÂÒ³ÃæÎÞ·¨Ìø³ö

bug fix:ÐÞ¸´ÔÚRequestÖÐ×Ô¶¨ÒåÊý¾ÝÎÞ·¨´«ÏòResponseµÄÎÊÌâ

v0.2.5

Ôö¼ÓÇëÇóÔâÓöÑÏÖØÒì³£Ê±ÖØÐ´ò»Ø¶ÓÁд¦Àí»úÖÆ

µ±Ò»¸öÇëÇóÔÚ¾­ÀúÍøÂçÇëÇóÒì³£µÄÖØÊÔ»úÖÆºóÒÀÈ»³öÏÖ·ÇÔ¤ÆÚÒì³££¬ÄÇôÕâ¸öÇëÇó»áÔÚ²»³¬¹ý¿ª·¢ÕßÉèÖõĻòÊÇĬÈϵÄ×î´óÖØÐ´¦Àí´ÎÊýµÄÇé¿öϱ»´ò»Ø¶ÓÁÐÖØÐµȴý±»´¦Àí£¬Èç¹û±»´ò»Ø´ÎÊý´ïµ½ÁË×î´óÏÞÖÆ£¬ÄÇôseimi»áµ÷Óÿª·¢Õß×ÔÐи²¸ÇʵÏÖµÄBaseSeimiCrawler.handleErrorRequest(Request request)À´´¦Àí¼Ç¼Õâ¸öÒì³£µÄÇëÇó¡£ÖØÐ´ò»ØµÈ´ý´¦Àí»úÖÆÅäºÏdelay¹¦ÄÜʹÓÿÉÒÔÔںܴó³Ì¶ÈÉϱÜÃâÒò·ÃÎÊÕ¾µãµÄ·´ÅÀ³æ²ßÂÔÒýÆðµÄÇëÇó´¦ÀíÒì³££¬²¢¶ªÊ§ÇëÇóµÄ¼Ç¼µÄÇé¿ö¡£

ÓÅ»¯È¥ÖØÅжÏ

ÓÅ»¯²»¹æ·¶Ò³ÃæµÄ±àÂë»ñÈ¡·½Ê½

v0.2.4

×Ô¶¯Ìø×ªÔöÇ¿£¬³ý301,302ÍâÔö¼ÓÖ§³Öʶ±ðͨ¹ýmeta refresh·½Ê½µÄÒ³ÃæÌø×ª

Response¶ÔÏóÔö¼Óͨ¹ýgetRealUrl()»ñÈ¡ÄÚÈݶÔÓ¦ÖØ¶¨ÏòÒÔ¼°Ìø×ªºóµÄÕæÊµÁ¬½Ó

ͨ¹ý×¢½â@CrawlerÖÐ'useUnrepeated'ÊôÐÔ¿ØÖÆÊÇ·ñÆôÓÃϵͳ¼¶È¥ÖØ»úÖÆ£¬Ä¬ÈÏ¿ªÆô

v0.2.3

Ö§³Ö×Ô¶¨Ò嶯̬´úÀí
¿ª·¢Õß¿ÉÒÔͨ¹ý¸²¸ÇBaseSeimiCrawler.proxy()À´×ÔÐоö¶¨Ã¿´ÎÇëÇóËùʹÓõĴúÀí£¬¸²¸Ç¸Ã·½·¨²¢·µ»ØÓÐЧ´úÀíµØÖ·Ôò@CrawlerÖÐproxyÊôÐÔʧЧ¡£

Ìí¼Ó¶¯Ì¬´úÀí£¬¶¯Ì¬User-AgentʹÓÃdemo

v0.2.2

ÔöÇ¿¶Ô²»¹æ·¶ÍøÒ³µÄ±àÂëʶ±ðÓë¼æÈÝÄÜÁ¦

v0.2.1

ÓÅ»¯ºÚ°×Ãûµ¥ÕýÔò¹ýÂË»úÖÆ

v0.2.0

Ôö¼ÓÖ§³ÖÄÚǶhttp·þÎñAPIÌá½»json¸ñʽµÄRequestÇëÇó

Ôö¼ÓÕë¶ÔÇëÇóURL½øÐÐУÑéµÄallowRulesºÍdenyRulesµÄ×Ô¶¨ÒåÉèÖ㬼´°×Ãûµ¥¹æÔòºÍºÚÃûµ¥¹æÔò£¬¸ñʽ¾ùΪÕýÔò±í´ïʽ¡£Ä¬ÈÏΪnull²»½øÐмì²é

Ôö¼Ó¶ÔRequestµÄºÏ·¨ÐÔµÄͳһУÑé

Ôö¼ÓÖ§³ÖÇëÇó¼äµÄdelayʱ¼äÉèÖÃ

¼ò½é

SeimiCrawlerÊÇÒ»¸öÃô½ÝµÄ£¬¶ÀÁ¢²¿ÊðµÄ£¬Ö§³Ö·Ö²¼Ê½µÄJavaÅÀ³æ¿ò¼Ü£¬Ï£ÍûÄÜÔÚ×î´ó³Ì¶ÈÉϽµµÍÐÂÊÖ¿ª·¢Ò»¸ö¿ÉÓÃÐÔ¸ßÇÒÐÔÄܲ»²îµÄÅÀ³æÏµÍ³µÄÃż÷£¬ÒÔ¼°ÌáÉý¿ª·¢ÅÀ³æÏµÍ³µÄ¿ª·¢Ð§ÂÊ¡£ÔÚSeimiCrawlerµÄÊÀ½çÀ¾ø´ó¶àÊýÈËÖ»Ðè¹ØÐÄȥдץȡµÄÒµÎñÂß¼­¾Í¹»ÁË£¬ÆäÓàµÄSeimi°ïÄã¸ã¶¨¡£Éè¼ÆË¼ÏëÉÏSeimiCrawlerÊÜPythonµÄÅÀ³æ¿ò¼ÜScrapyÆô·¢ºÜ´ó£¬Í¬Ê±ÈÚºÏÁËJavaÓïÑÔ±¾ÉíÌØµãÓëSpringµÄÌØÐÔ£¬²¢Ï£ÍûÔÚ¹úÄÚ¸ü·½±ãÇÒÆÕ±éµÄʹÓøüÓÐЧÂʵÄXPath½âÎöHTML£¬ËùÒÔSeimiCrawlerĬÈϵÄHTML½âÎöÆ÷ÊÇJsoupXpath(¶ÀÁ¢À©Õ¹ÏîÄ¿£¬·Çjsoup×Ô´ø),ĬÈϽâÎöÌáÈ¡HTMLÊý¾Ý¹¤×÷¾ùʹÓÃXPathÀ´Íê³É£¨µ±È»£¬Êý¾Ý´¦ÀíÒà¿ÉÒÔ×ÔÐÐÑ¡ÔñÆäËû½âÎöÆ÷£©¡£
ÉçÇøÌÖÂÛ

´ó¼ÒÓÐʲôÎÊÌâ»ò½¨ÒéÏÖÔÚ¶¼¿ÉÒÔÑ¡Ôñͨ¹ýÏÂÃæµÄÓʼþÁбíÌÖÂÛ£¬Ê״η¢ÑÔǰÐèÏȶ©ÔIJ¢µÈ´ýÉóºËͨ¹ý£¨Ö÷ÒªÓÃÀ´ÆÁ±Î¹ã¸æÐû´«µÈ£©

¶©ÔÄ:Çë·¢Óʼþµ½ seimicrawler+subscribe@googlegroups.com

·¢ÑÔ:Çë·¢Óʼþµ½ seimicrawler@googlegroups.com

Í˶©:Çë·¢ÓʼþÖÁ seimicrawler+unsubscribe@googlegroups.com

ÏîĿԴÂë

Èç¹ûÄú¾õ×ÅÕâ¸öÏîÄ¿²»´í£¬µ½githubÉÏstarһϣ¬ÎÒÊDz»½éÒâµÄ

Èí¼þÏêÇ飺https://github.com/zhegexiaohuozi/SeimiCrawler

À´×Ô:¿ªÔ´ÖйúÉçÇø
ÎÄÕÂÆÀÂÛ

¹²ÓÐ 0 ÌõÆÀÂÛ