ºìÁªLinuxÃÅ»§
Linux°ïÖú

½á°Í·Ö´Ê0.34·¢²¼£¬PythonÖÐÎÄ·Ö´Ê×é¼þ

·¢²¼Ê±¼ä:2014-10-20 21:42:45À´Ô´:ºìÁª×÷Õß:empast
½á°Í·Ö´Ê 0.34 ·¢²¼£¬¸üÐÂÄÚÈÝÈçÏ£º

2014-10-20: version 0.34
1. ÌáÉýÐÔÄÜ£¬´Êµä½á¹¹ÓÉTrie¸ÄΪPrefix Set£¬ÄÚ´æÕ¼ÓüõÉÙ2/3, Ïê¼û£ºhttps://github.com/fxsjy/jieba/pull/187£»by @gumblex
2. ÐÞ¸´¹Ø¼ü´ÊÌáÈ¡¹¦ÄܵÄÐÔÄÜÎÊÌâ

jieba

"½á°Í"ÖÐÎÄ·Ö´Ê£º×ö×îºÃµÄPythonÖÐÎÄ·Ö´Ê×é¼þ "Jieba"

Feature

Ö§³ÖÈýÖÖ·Ö´Êģʽ£º

¾«È·Ä£Ê½£¬ÊÔͼ½«¾ä×Ó×ȷµØÇпª£¬ÊʺÏÎı¾·ÖÎö£»

ȫģʽ£¬°Ñ¾ä×ÓÖÐËùÓеĿÉÒԳɴʵĴÊÓﶼɨÃè³öÀ´, Ëٶȷdz£¿ì£¬µ«ÊDz»Äܽâ¾öÆçÒ壻

ËÑË÷ÒýÇæÄ£Ê½£¬ÔÚ¾«È·Ä£Ê½µÄ»ù´¡ÉÏ£¬¶Ô³¤´ÊÔÙ´ÎÇз֣¬Ìá¸ßÕÙ»ØÂÊ£¬ÊʺÏÓÃÓÚËÑË÷ÒýÇæ·Ö´Ê¡£

Ö§³Ö·±Ìå·Ö´Ê

Ö§³Ö×Ô¶¨Òå´Êµä

ÔÚÏßÑÝʾ

http://jiebademo.ap01.aws.af.cm/

(Powered by Appfog)

Python 2.x ϵݲװ

È«×Ô¶¯°²×°£ºeasy_install jieba »òÕß pip install jieba

°ë×Ô¶¯°²×°£ºÏÈÏÂÔØhttp://pypi.python.org/pypi/jieba/ £¬½âѹºóÔËÐÐpython setup.py install

ÊÖ¶¯°²×°£º½«jiebaĿ¼·ÅÖÃÓÚµ±Ç°Ä¿Â¼»òÕßsite-packagesĿ¼

ͨ¹ýimport jieba À´ÒýÓà £¨µÚÒ»´ÎimportʱÐèÒª¹¹½¨TrieÊ÷£¬ÐèÒª¼¸Ãëʱ¼ä£©

Python 3.x ϵݲװ

Ŀǰmaster·ÖÖ§ÊÇÖ»Ö§³ÖPython2.x µÄ

Python3.x °æ±¾µÄ·ÖÖ§Ò²ÒѾ­»ù±¾¿ÉÓ㺠https://github.com/fxsjy/jieba/tree/jieba3k
git clone https://github.com/fxsjy/jieba.git
git checkout jieba3k
python setup.py install

Algorithm

»ùÓÚTrieÊ÷½á¹¹ÊµÏÖ¸ßЧµÄ´ÊͼɨÃ裬Éú³É¾ä×ÓÖкº×ÖËùÓпÉÄܳɴÊÇé¿öËù¹¹³ÉµÄÓÐÏòÎÞ»·Í¼£¨DAG)

²ÉÓÃÁ˶¯Ì¬¹æ»®²éÕÒ×î´ó¸ÅÂÊ·¾¶, ÕÒ³ö»ùÓÚ´ÊÆµµÄ×î´óÇзÖ×éºÏ

¶ÔÓÚδµÇ¼´Ê£¬²ÉÓÃÁË»ùÓÚºº×ֳɴÊÄÜÁ¦µÄHMMÄ£ÐÍ£¬Ê¹ÓÃÁËViterbiËã·¨

¹¦ÄÜ 1)£º·Ö´Ê

jieba.cut·½·¨½ÓÊÜÁ½¸öÊäÈë²ÎÊý: 1) µÚÒ»¸ö²ÎÊýΪÐèÒª·Ö´ÊµÄ×Ö·û´® 2£©cut_all²ÎÊýÓÃÀ´¿ØÖÆÊÇ·ñ²ÉÓÃȫģʽ

jieba.cut_for_search·½·¨½ÓÊÜÒ»¸ö²ÎÊý£ºÐèÒª·Ö´ÊµÄ×Ö·û´®,¸Ã·½·¨ÊʺÏÓÃÓÚËÑË÷ÒýÇæ¹¹½¨µ¹ÅÅË÷ÒýµÄ·Ö´Ê£¬Á£¶È±È½Ïϸ

×¢Ò⣺´ý·Ö´ÊµÄ×Ö·û´®¿ÉÒÔÊÇgbk×Ö·û´®¡¢utf-8×Ö·û´®»òÕßunicode

jieba.cutÒÔ¼°jieba.cut_for_search·µ»ØµÄ½á¹¹¶¼ÊÇÒ»¸ö¿Éµü´úµÄgenerator£¬¿ÉÒÔʹÓÃforÑ­»·À´»ñµÃ·Ö´ÊºóµÃµ½µÄÿһ¸ö´ÊÓï(unicode)£¬Ò²¿ÉÒÔÓÃlist(jieba.cut(...))ת»¯Îªlist

´úÂëʾÀý( ·Ö´Ê )
#encoding=utf-8
import jieba

seg_list = jieba.cut("ÎÒÀ´µ½±±¾©Ç廪´óѧ",cut_all=True)
print "Full Mode:", "/ ".join(seg_list) #ȫģʽ

seg_list = jieba.cut("ÎÒÀ´µ½±±¾©Ç廪´óѧ",cut_all=False)
print "Default Mode:", "/ ".join(seg_list) #¾«È·Ä£Ê½

seg_list = jieba.cut("ËûÀ´µ½ÁËÍøÒ׺¼ÑдóÏÃ") #ĬÈÏÊǾ«È·Ä£Ê½
print ", ".join(seg_list)

seg_list = jieba.cut_for_search("СÃ÷˶ʿ±ÏÒµÓÚÖйú¿ÆÑ§Ôº¼ÆËãËù£¬ºóÔÚÈÕ±¾¾©¶¼´óѧÉîÔì") #ËÑË÷ÒýÇæÄ£Ê½
print ", ".join(seg_list)
Output:
¡¾È«Ä£Ê½¡¿: ÎÒ/ À´µ½/ ±±¾©/ Ç廪/ Ç廪´óѧ/ »ª´ó/ ´óѧ

¡¾¾«È·Ä£Ê½¡¿: ÎÒ/ À´µ½/ ±±¾©/ Ç廪´óѧ

¡¾Ð´Êʶ±ð¡¿£ºËû, À´µ½, ÁË, ÍøÒ×, º¼ÑÐ, ´óÏà (´Ë´¦£¬¡°º¼ÑС±²¢Ã»ÓÐÔڴʵäÖУ¬µ«ÊÇÒ²±»ViterbiË㷨ʶ±ð³öÀ´ÁË)

¡¾ËÑË÷ÒýÇæÄ£Ê½¡¿£º СÃ÷, ˶ʿ, ±ÏÒµ, ÓÚ, Öйú, ¿ÆÑ§, ѧԺ, ¿ÆÑ§Ôº, Öйú¿ÆÑ§Ôº, ¼ÆËã, ¼ÆËãËù, ºó, ÔÚ, ÈÕ±¾, ¾©¶¼, ´óѧ, ÈÕ±¾¾©¶¼´óѧ, ÉîÔì
¹¦ÄÜ 2) £ºÌí¼Ó×Ô¶¨Òå´Êµä

¿ª·¢Õß¿ÉÒÔÖ¸¶¨×Ô¼º×Ô¶¨ÒåµÄ´Êµä£¬ÒÔ±ã°üº¬jieba´Ê¿âÀïûÓеĴʡ£ËäÈ»jiebaÓÐдÊʶ±ðÄÜÁ¦£¬µ«ÊÇ×ÔÐÐÌí¼ÓдʿÉÒÔ±£Ö¤¸ü¸ßµÄÕýÈ·ÂÊ

Ó÷¨£º jieba.load_userdict(file_name) # file_nameΪ×Ô¶¨Òå´ÊµäµÄ·¾¶

´Êµä¸ñʽºÍdict.txtÒ»Ñù£¬Ò»¸ö´ÊÕ¼Ò»ÐУ»Ã¿Ò»ÐзÖÈý²¿·Ö£¬Ò»²¿·ÖΪ´ÊÓÁíÒ»²¿·ÖΪ´ÊƵ£¬×îºóΪ´ÊÐÔ£¨¿ÉÊ¡ÂÔ£©£¬Óÿոñ¸ô¿ª

·¶Àý£º

֮ǰ£º ÀîС¸£ / ÊÇ / ´´Ð / °ì / Ö÷ÈÎ / Ò² / ÊÇ / ÔÆ / ¼ÆËã / ·½Ãæ / µÄ / ר¼Ò /

¼ÓÔØ×Ô¶¨Òå´Ê¿âºó£º¡¡ÀîС¸£ / ÊÇ / ´´Ð°ì / Ö÷ÈÎ / Ò² / ÊÇ / ÔÆ¼ÆËã / ·½Ãæ / µÄ / ר¼Ò /

×Ô¶¨Òå´Êµä£ºhttps://github.com/fxsjy/jieba/blob/master/test/userdict.txt

Ó÷¨Ê¾Àý£ºhttps://github.com/fxsjy/jieba/blob/master/test/test_userdict.py

"ͨ¹ýÓû§×Ô¶¨Òå´ÊµäÀ´ÔöÇ¿ÆçÒå¾À´íÄÜÁ¦" --- https://github.com/fxsjy/jieba/issues/14

¹¦ÄÜ 3) £º¹Ø¼ü´ÊÌáÈ¡

jieba.analyse.extract_tags(sentence,topK) #ÐèÒªÏÈimport jieba.analyse

setenceΪ´ýÌáÈ¡µÄÎı¾

topKΪ·µ»Ø¼¸¸öTF/IDFÈ¨ÖØ×î´óµÄ¹Ø¼ü´Ê£¬Ä¬ÈÏֵΪ20

´úÂëʾÀý £¨¹Ø¼ü´ÊÌáÈ¡£©
https://github.com/fxsjy/jieba/blob/master/test/extract_tags.py
¹¦ÄÜ 4) : ´ÊÐÔ±ê×¢

±ê×¢¾ä×ӷִʺóÿ¸ö´ÊµÄ´ÊÐÔ£¬²ÉÓúÍictclas¼æÈݵıê¼Ç·¨

Ó÷¨Ê¾Àý
>>> import jieba.posseg as pseg
>>> words =pseg.cut("ÎÒ°®±±¾©Ìì°²ÃÅ")
>>> for w in words:
... print w.word,w.flag
...
ÎÒ r
°® v
±±¾© ns
Ìì°²ÃÅ ns

¹¦ÄÜ 5) : ²¢ÐзִÊ

Ô­Àí£º½«Ä¿±êÎı¾°´Ðзָôºó£¬°Ñ¸÷ÐÐÎı¾·ÖÅäµ½¶à¸öpython½ø³Ì²¢Ðзִʣ¬È»ºó¹é²¢½á¹û£¬´Ó¶ø»ñµÃ·Ö´ÊËٶȵĿɹÛÌáÉý

»ùÓÚpython×Ô´øµÄmultiprocessingÄ£¿é£¬Ä¿Ç°Ôݲ»Ö§³Öwindows

Ó÷¨£º

jieba.enable_parallel(4) # ¿ªÆô²¢ÐзִÊģʽ£¬²ÎÊýΪ²¢Ðнø³ÌÊý

jieba.disable_parallel() # ¹Ø±Õ²¢ÐзִÊģʽ

Àý×Ó£º https://github.com/fxsjy/jieba/blob/master/test/parallel/test_file.py

ʵÑé½á¹û£ºÔÚ4ºË3.4GHz Linux»úÆ÷ÉÏ£¬¶Ô½ðӹȫ¼¯½øÐо«È··Ö´Ê£¬»ñµÃÁË1MB/sµÄËÙ¶È£¬Êǵ¥½ø³Ì°æµÄ3.3±¶¡£

¹¦ÄÜ 6) : Tokenize£º·µ»Ø´ÊÓïÔÚÔ­ÎĵįðʼλÖÃ

×¢Ò⣬ÊäÈë²ÎÊýÖ»½ÓÊÜunicode

ĬÈÏģʽ

result = jieba.tokenize(u'ÓÀºÍ·þ×°ÊÎÆ·ÓÐÏÞ¹«Ë¾') for tk in result: print "word %s\t\t start: %d \t\t end:%d" % (tk[0],tk[1],tk[2]) word ÓÀºÍ start: 0 end:2
word ·þ×° start: 2 end:4
word ÊÎÆ· start: 4 end:6
word ÓÐÏÞ¹«Ë¾ start: 6 end:10 •
ËÑË÷ģʽ

result = jieba.tokenize(u'ÓÀºÍ·þ×°ÊÎÆ·ÓÐÏÞ¹«Ë¾',mode='search') for tk in result: print "word %s\t\t start: %d \t\t end:%d" % (tk[0],tk[1],tk[2]) word ÓÀºÍ start: 0 end:2
word ·þ×° start: 2 end:4
word ÊÎÆ· start: 4 end:6
word ÓÐÏÞ start: 6 end:8
word ¹«Ë¾ start: 8 end:10
word ÓÐÏÞ¹«Ë¾ start: 6 end:10
¹¦ÄÜ 7) : ChineseAnalyzer for WhooshËÑË÷ÒýÇæ

ÒýÓ㺠from jieba.analyse import ChineseAnalyzer

Ó÷¨Ê¾Àý£ºhttps://github.com/fxsjy/jieba/blob/master/test/test_whoosh.py

ÆäËû´Êµä

1.Õ¼ÓÃÄÚ´æ½ÏСµÄ´ÊµäÎļþ https://github.com/fxsjy/jieba/raw/master/extra_dict/dict.txt.small

2.Ö§³Ö·±Ìå·Ö´Ê¸üºÃµÄ´ÊµäÎļþ https://github.com/fxsjy/jieba/raw/master/extra_dict/dict.txt.big

ÏÂÔØÄãËùÐèÒªµÄ´Êµä£¬È»ºó¸²¸Çjieba/dict.txt ¼´¿É»òÕßÓÃjieba.set_dictionary('data/dict.txt.big')

Ä£¿é³õʼ»¯»úÖÆµÄ¸Ä±ä:lazy load £¨´Ó0.28°æ±¾¿ªÊ¼£©

jieba²ÉÓÃÑÓ³Ù¼ÓÔØ£¬"import jieba"²»»áÁ¢¼´´¥·¢´ÊµäµÄ¼ÓÔØ£¬Ò»µ©ÓбØÒª²Å¿ªÊ¼¼ÓÔØ´Êµä¹¹½¨trie¡£Èç¹ûÄãÏëÊÖ¹¤³õʼjieba£¬Ò²¿ÉÒÔÊÖ¶¯³õʼ»¯¡£
import jieba
jieba.initialize() #ÊÖ¶¯³õʼ»¯£¨¿ÉÑ¡£©
ÔÚ0.28֮ǰµÄ°æ±¾ÊDz»ÄÜÖ¸¶¨Ö÷´ÊµäµÄ·¾¶µÄ£¬ÓÐÁËÑÓ³Ù¼ÓÔØ»úÖÆºó£¬Äã¿ÉÒԸıäÖ÷´ÊµäµÄ·¾¶:
jieba.set_dictionary('data/dict.txt.big')
Àý×Ó£º https://github.com/fxsjy/jieba/blob/master/test/test_change_dictpath.py

·Ö´ÊËÙ¶È

1.5 MB / Second in Full Mode

400 KB / Second in Default Mode

Test Env: Intel(R) Core(TM) i7-2600 CPU @ 3.4GHz£»¡¶Î§³Ç¡·.txt

³£¼ûÎÊÌâ

1£©Ä£Ð͵ÄÊý¾ÝÊÇÈçºÎÉú³ÉµÄ£¿https://github.com/fxsjy/jieba/issues/7

2£©Õâ¸ö¿âµÄÊÚȨÊÇ? https://github.com/fxsjy/jieba/issues/2

¸ü¶àÎÊÌâÇëµã»÷£ºhttps://github.com/fxsjy/jieba/issues?sort=updated&state=closed

Change Log

http://www.oschina.net/p/jieba/news#list

Èí¼þÏêÇ飺https://github.com/fxsjy/jieba

ÏÂÔØµØÖ·£ºhttps://github.com/fxsjy/jieba/releases

À´×Ô:¿ªÔ´ÖйúÉçÇø
ÎÄÕÂÆÀÂÛ

¹²ÓÐ 0 ÌõÆÀÂÛ