½á°Í·Ö´Ê 0.34 ·¢²¼£¬¸üÐÂÄÚÈÝÈçÏ£º
2014-10-20: version 0.34
1. ÌáÉýÐÔÄÜ£¬´Êµä½á¹¹ÓÉTrie¸ÄΪPrefix Set£¬ÄÚ´æÕ¼ÓüõÉÙ2/3, Ïê¼û£ºhttps://github.com/fxsjy/jieba/pull/187£»by @gumblex
2. ÐÞ¸´¹Ø¼ü´ÊÌáÈ¡¹¦ÄܵÄÐÔÄÜÎÊÌâ
jieba
"½á°Í"ÖÐÎÄ·Ö´Ê£º×ö×îºÃµÄPythonÖÐÎÄ·Ö´Ê×é¼þ "Jieba"
Feature
Ö§³ÖÈýÖÖ·Ö´Êģʽ£º
¾«È·Ä£Ê½£¬ÊÔͼ½«¾ä×Ó×ȷµØÇпª£¬ÊʺÏÎı¾·ÖÎö£»
ȫģʽ£¬°Ñ¾ä×ÓÖÐËùÓеĿÉÒԳɴʵĴÊÓﶼɨÃè³öÀ´, Ëٶȷdz£¿ì£¬µ«ÊDz»Äܽâ¾öÆçÒ壻
ËÑË÷ÒýÇæÄ£Ê½£¬ÔÚ¾«È·Ä£Ê½µÄ»ù´¡ÉÏ£¬¶Ô³¤´ÊÔÙ´ÎÇз֣¬Ìá¸ßÕÙ»ØÂÊ£¬ÊʺÏÓÃÓÚËÑË÷ÒýÇæ·Ö´Ê¡£
Ö§³Ö·±Ìå·Ö´Ê
Ö§³Ö×Ô¶¨Òå´Êµä
ÔÚÏßÑÝʾ
http://jiebademo.ap01.aws.af.cm/
(Powered by Appfog)
Python 2.x ϵݲװ
È«×Ô¶¯°²×°£ºeasy_install jieba »òÕß pip install jieba
°ë×Ô¶¯°²×°£ºÏÈÏÂÔØhttp://pypi.python.org/pypi/jieba/ £¬½âѹºóÔËÐÐpython setup.py install
ÊÖ¶¯°²×°£º½«jiebaĿ¼·ÅÖÃÓÚµ±Ç°Ä¿Â¼»òÕßsite-packagesĿ¼
ͨ¹ýimport jieba À´ÒýÓà £¨µÚÒ»´ÎimportʱÐèÒª¹¹½¨TrieÊ÷£¬ÐèÒª¼¸Ãëʱ¼ä£©
Python 3.x ϵݲװ
Ŀǰmaster·ÖÖ§ÊÇÖ»Ö§³ÖPython2.x µÄ
Python3.x °æ±¾µÄ·ÖÖ§Ò²ÒѾ»ù±¾¿ÉÓ㺠https://github.com/fxsjy/jieba/tree/jieba3k
git clone https://github.com/fxsjy/jieba.git
git checkout jieba3k
python setup.py install
Algorithm
»ùÓÚTrieÊ÷½á¹¹ÊµÏÖ¸ßЧµÄ´ÊͼɨÃ裬Éú³É¾ä×ÓÖкº×ÖËùÓпÉÄܳɴÊÇé¿öËù¹¹³ÉµÄÓÐÏòÎÞ»·Í¼£¨DAG)
²ÉÓÃÁ˶¯Ì¬¹æ»®²éÕÒ×î´ó¸ÅÂÊ·¾¶, ÕÒ³ö»ùÓÚ´ÊÆµµÄ×î´óÇзÖ×éºÏ
¶ÔÓÚδµÇ¼´Ê£¬²ÉÓÃÁË»ùÓÚºº×ֳɴÊÄÜÁ¦µÄHMMÄ£ÐÍ£¬Ê¹ÓÃÁËViterbiËã·¨
¹¦ÄÜ 1)£º·Ö´Ê
jieba.cut·½·¨½ÓÊÜÁ½¸öÊäÈë²ÎÊý: 1) µÚÒ»¸ö²ÎÊýΪÐèÒª·Ö´ÊµÄ×Ö·û´® 2£©cut_all²ÎÊýÓÃÀ´¿ØÖÆÊÇ·ñ²ÉÓÃȫģʽ
jieba.cut_for_search·½·¨½ÓÊÜÒ»¸ö²ÎÊý£ºÐèÒª·Ö´ÊµÄ×Ö·û´®,¸Ã·½·¨ÊʺÏÓÃÓÚËÑË÷ÒýÇæ¹¹½¨µ¹ÅÅË÷ÒýµÄ·Ö´Ê£¬Á£¶È±È½Ïϸ
×¢Ò⣺´ý·Ö´ÊµÄ×Ö·û´®¿ÉÒÔÊÇgbk×Ö·û´®¡¢utf-8×Ö·û´®»òÕßunicode
jieba.cutÒÔ¼°jieba.cut_for_search·µ»ØµÄ½á¹¹¶¼ÊÇÒ»¸ö¿Éµü´úµÄgenerator£¬¿ÉÒÔʹÓÃforÑ»·À´»ñµÃ·Ö´ÊºóµÃµ½µÄÿһ¸ö´ÊÓï(unicode)£¬Ò²¿ÉÒÔÓÃlist(jieba.cut(...))ת»¯Îªlist
´úÂëʾÀý( ·Ö´Ê )
#encoding=utf-8
import jieba
seg_list = jieba.cut("ÎÒÀ´µ½±±¾©Ç廪´óѧ",cut_all=True)
print "Full Mode:", "/ ".join(seg_list) #ȫģʽ
seg_list = jieba.cut("ÎÒÀ´µ½±±¾©Ç廪´óѧ",cut_all=False)
print "Default Mode:", "/ ".join(seg_list) #¾«È·Ä£Ê½
seg_list = jieba.cut("ËûÀ´µ½ÁËÍøÒ׺¼ÑдóÏÃ") #ĬÈÏÊǾ«È·Ä£Ê½
print ", ".join(seg_list)
seg_list = jieba.cut_for_search("СÃ÷˶ʿ±ÏÒµÓÚÖйú¿ÆÑ§Ôº¼ÆËãËù£¬ºóÔÚÈÕ±¾¾©¶¼´óѧÉîÔì") #ËÑË÷ÒýÇæÄ£Ê½
print ", ".join(seg_list)
Output:
¡¾È«Ä£Ê½¡¿: ÎÒ/ À´µ½/ ±±¾©/ Ç廪/ Ç廪´óѧ/ »ª´ó/ ´óѧ
¡¾¾«È·Ä£Ê½¡¿: ÎÒ/ À´µ½/ ±±¾©/ Ç廪´óѧ
¡¾Ð´Êʶ±ð¡¿£ºËû, À´µ½, ÁË, ÍøÒ×, º¼ÑÐ, ´óÏà (´Ë´¦£¬¡°º¼ÑС±²¢Ã»ÓÐÔڴʵäÖУ¬µ«ÊÇÒ²±»ViterbiË㷨ʶ±ð³öÀ´ÁË)
¡¾ËÑË÷ÒýÇæÄ£Ê½¡¿£º СÃ÷, ˶ʿ, ±ÏÒµ, ÓÚ, Öйú, ¿ÆÑ§, ѧԺ, ¿ÆÑ§Ôº, Öйú¿ÆÑ§Ôº, ¼ÆËã, ¼ÆËãËù, ºó, ÔÚ, ÈÕ±¾, ¾©¶¼, ´óѧ, ÈÕ±¾¾©¶¼´óѧ, ÉîÔì
¹¦ÄÜ 2) £ºÌí¼Ó×Ô¶¨Òå´Êµä
¿ª·¢Õß¿ÉÒÔÖ¸¶¨×Ô¼º×Ô¶¨ÒåµÄ´Êµä£¬ÒÔ±ã°üº¬jieba´Ê¿âÀïûÓеĴʡ£ËäÈ»jiebaÓÐдÊʶ±ðÄÜÁ¦£¬µ«ÊÇ×ÔÐÐÌí¼ÓдʿÉÒÔ±£Ö¤¸ü¸ßµÄÕýÈ·ÂÊ
Ó÷¨£º jieba.load_userdict(file_name) # file_nameΪ×Ô¶¨Òå´ÊµäµÄ·¾¶
´Êµä¸ñʽºÍdict.txtÒ»Ñù£¬Ò»¸ö´ÊÕ¼Ò»ÐУ»Ã¿Ò»ÐзÖÈý²¿·Ö£¬Ò»²¿·ÖΪ´ÊÓÁíÒ»²¿·ÖΪ´ÊƵ£¬×îºóΪ´ÊÐÔ£¨¿ÉÊ¡ÂÔ£©£¬Óÿոñ¸ô¿ª
·¶Àý£º
֮ǰ£º ÀîС¸£ / ÊÇ / ´´Ð / °ì / Ö÷ÈÎ / Ò² / ÊÇ / ÔÆ / ¼ÆËã / ·½Ãæ / µÄ / ר¼Ò /
¼ÓÔØ×Ô¶¨Òå´Ê¿âºó£º¡¡ÀîС¸£ / ÊÇ / ´´Ð°ì / Ö÷ÈÎ / Ò² / ÊÇ / ÔÆ¼ÆËã / ·½Ãæ / µÄ / ר¼Ò /
×Ô¶¨Òå´Êµä£ºhttps://github.com/fxsjy/jieba/blob/master/test/userdict.txt
Ó÷¨Ê¾Àý£ºhttps://github.com/fxsjy/jieba/blob/master/test/test_userdict.py
"ͨ¹ýÓû§×Ô¶¨Òå´ÊµäÀ´ÔöÇ¿ÆçÒå¾À´íÄÜÁ¦" --- https://github.com/fxsjy/jieba/issues/14
¹¦ÄÜ 3) £º¹Ø¼ü´ÊÌáÈ¡
jieba.analyse.extract_tags(sentence,topK) #ÐèÒªÏÈimport jieba.analyse
setenceΪ´ýÌáÈ¡µÄÎı¾
topKΪ·µ»Ø¼¸¸öTF/IDFÈ¨ÖØ×î´óµÄ¹Ø¼ü´Ê£¬Ä¬ÈÏֵΪ20
´úÂëʾÀý £¨¹Ø¼ü´ÊÌáÈ¡£©
https://github.com/fxsjy/jieba/blob/master/test/extract_tags.py
¹¦ÄÜ 4) : ´ÊÐÔ±ê×¢
±ê×¢¾ä×ӷִʺóÿ¸ö´ÊµÄ´ÊÐÔ£¬²ÉÓúÍictclas¼æÈݵıê¼Ç·¨
Ó÷¨Ê¾Àý
>>> import jieba.posseg as pseg
>>> words =pseg.cut("ÎÒ°®±±¾©Ìì°²ÃÅ")
>>> for w in words:
... print w.word,w.flag
...
ÎÒ r
°® v
±±¾© ns
Ìì°²ÃÅ ns
¹¦ÄÜ 5) : ²¢ÐзִÊ
ÔÀí£º½«Ä¿±êÎı¾°´Ðзָôºó£¬°Ñ¸÷ÐÐÎı¾·ÖÅäµ½¶à¸öpython½ø³Ì²¢Ðзִʣ¬È»ºó¹é²¢½á¹û£¬´Ó¶ø»ñµÃ·Ö´ÊËٶȵĿɹÛÌáÉý
»ùÓÚpython×Ô´øµÄmultiprocessingÄ£¿é£¬Ä¿Ç°Ôݲ»Ö§³Öwindows
Ó÷¨£º
jieba.enable_parallel(4) # ¿ªÆô²¢ÐзִÊģʽ£¬²ÎÊýΪ²¢Ðнø³ÌÊý
jieba.disable_parallel() # ¹Ø±Õ²¢ÐзִÊģʽ
Àý×Ó£º https://github.com/fxsjy/jieba/blob/master/test/parallel/test_file.py
ʵÑé½á¹û£ºÔÚ4ºË3.4GHz Linux»úÆ÷ÉÏ£¬¶Ô½ðӹȫ¼¯½øÐо«È··Ö´Ê£¬»ñµÃÁË1MB/sµÄËÙ¶È£¬Êǵ¥½ø³Ì°æµÄ3.3±¶¡£
¹¦ÄÜ 6) : Tokenize£º·µ»Ø´ÊÓïÔÚÔÎĵįðʼλÖÃ
×¢Ò⣬ÊäÈë²ÎÊýÖ»½ÓÊÜunicode
ĬÈÏģʽ
result = jieba.tokenize(u'ÓÀºÍ·þ×°ÊÎÆ·ÓÐÏÞ¹«Ë¾') for tk in result: print "word %s\t\t start: %d \t\t end:%d" % (tk[0],tk[1],tk[2]) word ÓÀºÍ start: 0 end:2
word ·þ×° start: 2 end:4
word ÊÎÆ· start: 4 end:6
word ÓÐÏÞ¹«Ë¾ start: 6 end:10 •
ËÑË÷ģʽ
result = jieba.tokenize(u'ÓÀºÍ·þ×°ÊÎÆ·ÓÐÏÞ¹«Ë¾',mode='search') for tk in result: print "word %s\t\t start: %d \t\t end:%d" % (tk[0],tk[1],tk[2]) word ÓÀºÍ start: 0 end:2
word ·þ×° start: 2 end:4
word ÊÎÆ· start: 4 end:6
word ÓÐÏÞ start: 6 end:8
word ¹«Ë¾ start: 8 end:10
word ÓÐÏÞ¹«Ë¾ start: 6 end:10
¹¦ÄÜ 7) : ChineseAnalyzer for WhooshËÑË÷ÒýÇæ
ÒýÓ㺠from jieba.analyse import ChineseAnalyzer
Ó÷¨Ê¾Àý£ºhttps://github.com/fxsjy/jieba/blob/master/test/test_whoosh.py
ÆäËû´Êµä
1.Õ¼ÓÃÄÚ´æ½ÏСµÄ´ÊµäÎļþ https://github.com/fxsjy/jieba/raw/master/extra_dict/dict.txt.small
2.Ö§³Ö·±Ìå·Ö´Ê¸üºÃµÄ´ÊµäÎļþ https://github.com/fxsjy/jieba/raw/master/extra_dict/dict.txt.big
ÏÂÔØÄãËùÐèÒªµÄ´Êµä£¬È»ºó¸²¸Çjieba/dict.txt ¼´¿É»òÕßÓÃjieba.set_dictionary('data/dict.txt.big')
Ä£¿é³õʼ»¯»úÖÆµÄ¸Ä±ä:lazy load £¨´Ó0.28°æ±¾¿ªÊ¼£©
jieba²ÉÓÃÑÓ³Ù¼ÓÔØ£¬"import jieba"²»»áÁ¢¼´´¥·¢´ÊµäµÄ¼ÓÔØ£¬Ò»µ©ÓбØÒª²Å¿ªÊ¼¼ÓÔØ´Êµä¹¹½¨trie¡£Èç¹ûÄãÏëÊÖ¹¤³õʼjieba£¬Ò²¿ÉÒÔÊÖ¶¯³õʼ»¯¡£
import jieba
jieba.initialize() #ÊÖ¶¯³õʼ»¯£¨¿ÉÑ¡£©
ÔÚ0.28֮ǰµÄ°æ±¾ÊDz»ÄÜÖ¸¶¨Ö÷´ÊµäµÄ·¾¶µÄ£¬ÓÐÁËÑÓ³Ù¼ÓÔØ»úÖÆºó£¬Äã¿ÉÒԸıäÖ÷´ÊµäµÄ·¾¶:
jieba.set_dictionary('data/dict.txt.big')
Àý×Ó£º https://github.com/fxsjy/jieba/blob/master/test/test_change_dictpath.py
·Ö´ÊËÙ¶È
1.5 MB / Second in Full Mode
400 KB / Second in Default Mode
Test Env: Intel(R) Core(TM) i7-2600 CPU @ 3.4GHz£»¡¶Î§³Ç¡·.txt
³£¼ûÎÊÌâ
1£©Ä£Ð͵ÄÊý¾ÝÊÇÈçºÎÉú³ÉµÄ£¿https://github.com/fxsjy/jieba/issues/7
2£©Õâ¸ö¿âµÄÊÚȨÊÇ? https://github.com/fxsjy/jieba/issues/2
¸ü¶àÎÊÌâÇëµã»÷£ºhttps://github.com/fxsjy/jieba/issues?sort=updated&state=closed
Change Log
http://www.oschina.net/p/jieba/news#list
Èí¼þÏêÇ飺https://github.com/fxsjy/jieba
ÏÂÔØµØÖ·£ºhttps://github.com/fxsjy/jieba/releases
À´×Ô:¿ªÔ´ÖйúÉçÇø

