ºìÁªLinuxÃÅ»§
Linux°ïÖú

Linux Unicode ±à³Ì

·¢²¼Ê±¼ä:2006-04-20 01:21:01À´Ô´:ºìÁª×÷Õß:°®ÁãÕûÕû
×÷Ϊһ¸ö¼ÆËã»úµÄ¶à×Ö½Ú×Ö·û±íʾϵͳ£¬Unicode Ö§³ÖÊÀ½çÉÏËùÓÐÓïÑԵıàÂëºÍת»»¡£ÕâƪÎÄÕÂ˵Ã÷ÁË Linux Ó¦ÓóÌÐòÖеĹú¼ÊÓïÑÔÖ§³ÖµÄÖØÒªÐÔ£¬ÒÔ¼°Éè¼Æ Unicode Ö§³Ö²¢½«Ö®½áºÏµ½ Linux Ó¦ÓóÌÐòÖÐÈ¥µÄ˼Ïë¡£
Unicode ²¢²»Ö»ÊÇÒ»¸ö±à³Ì¹¤¾ß£¬Ëü»¹ÊÇÒ»¸öÕþÖεġ¢¾­¼ÃµÄ¹¤¾ß¡£Ã»ÓнáºÏÊÀ½çµÄÓïÑÔÖ§³ÖµÄÓ¦ÓóÌÐòͨ³£Ö»Äܱ»ÄÇЩÄܶÁд ASCII ËùÖ§³ÖÓïÑԵĸöÈËʹÓá£ÕâʹµÃ½¨Á¢ÔÚ ASCII »ù´¡Ö®ÉϵļÆËã»ú¼¼ÊõÍÑÀëÁËÊÀ½çÉϴ󲿷ÖÈË¡£Unicode ÔÊÐí³ÌÐòʹÓÃÊÀ½çÉÏÈκÎÒ»ÖÖ×Ö·û¼¯£¬Òò´ËËüÖ§³ÖËùÓÐÓïÑÔ¡£

Unicode ÈóÌÐòԱΪÆÕͨÈËÌṩÓÃËûÃDZ¾¹úÓïÑÔ¾ÍÄÜʹÓõÄÈí¼þ¡£ÕâÑù¾Í²»ÓÃÔÙѧһÃÅÍâÓïÁË£¬¶øÇÒ¸üÈÝÒ×ʵÏÖ¼ÆËã»ú¼¼ÊõÉç»áºÍ²ÆÕþÉϵÄÀûÒæ¡£ºÜÈÝÒ×ÉèÏ룬Èç¹ûÓû§±ØÐëΪʹÓÃÒòÌØÍøä¯ÀÀÆ÷¶øѧϰÎÚ¶û¶¼ÓïµÄ»°£¬Äú¾ÍÄÑÒÔ¿´µ½¼ÆËã»úÔÚÃÀ¹úµÄʹÓá£Web ¾Í¸ü²»»á³öÏÖÁË¡£

Linux ³Ðµ£ÁË¶Ô Unicode ºÜ´ó³Ì¶ÈÉϵÄÖ§³Ö¡£Unicode Ö§³Ö±»Ç¶Èëµ½Äں˺ʹúÂ뿪·¢¿âÖС£Ôںܴó³Ì¶ÈÉÏ£¬Ê¹ÓóÌÐòÖм¸¾ä¼òµ¥µÄÃüÁî¾ÍÄܽ«ËüÃÇ×Ô¶¯µÄ½áºÏµ½´úÂëÖС£

ËùÓÐÏÖ´ú×Ö·û¼¯µÄ»ù´¡¶¼ÊÇÔÚ 1968 ÄêÒÔ ANSIX3.4 °æ±¾³ö°æµÄÃÀ¹úÐÅÏ¢½»»»±ê×¼Â루American Standard Code for Information Interchange£¬ASCII£©¡£Ò»¸öÖµµÃ×¢ÒâµÄÀýÍâÊÇÔÚ ASCII ֮ǰ¶¨ÒåµÄ IBM µÄÀ©³äµÄ¶þ½øÖƱàÂëµÄÊ®½øÖƽ»»»Â루Extended Binary Coded Decimal Information Code£¬EBCDIC£©¡£ASCII ÊÇÒ»¸ö±àÂë×Ö·û¼¯£¨coded character set£¬CCS£©£¬»»¾ä»°Ëµ£¬ËüÊÇÕûÊýµ½×Ö·û±íʾµÄÓ³Éä¡£ASCII ±àÂë×Ö·û¼¯ÔÊÐíÓÃÒ»¸ö°Ë루»ùÓÚ¶þ½øÖƵģ¬ÓÃÖµ 0 »ò 1 ±íʾµÄ£©×ֶλò×Ö½Ú£¨2^8 =256£©±íʾ 256 ¸ö×Ö·û¡£ÕâÊÇÒ»¸ö¸ß¶ÈÊÜÏ޵ıàÂë×Ö·û¼¯£¬Ëü²»ÄܱíʾÐí¶à²»Í¬ÓïÑÔµÄËùÓÐ×Ö·û£¨ÈçÖÐÎĺÍÈÕÎÄ£©£¬²»Äܱíʾ¿Æѧ·ûºÅ£¬¸ü²»Äܱíʾ¹Å´úÎÄ×Ö£¨ÉñÃØ·ûºÅºÍÏóÐÎÎÄ×Ö£©ºÍÒôÀÖ·ûºÅ¡£Í¨¹ý¸ü¸ÄÒ»¸ö×ֽڵij¤¶È¶øʹ¸ü´óµÄ×Ö·û¼¯µÃÒÔ±»±àÂ룬ÕâËƺõÓÐЧµ«ÍêÈ«²»ÇÐʵ¼Ê¡£ËùÓеļÆËã»ú¶¼»ùÓÚ°Ëλ×Ö½Ú¡£½â¾ö·½·¨ÊÇÒ»ÖÖ×Ö·û±àÂë·½°¸£¨Character encoding scheme£¬CES£©-- Óö¨³¤»ò±ä³¤µÄ¶à×Ö½ÚÐòÁÐÄܹ»±íʾ±È 256 ´óµÄÊý.ÕâЩÊýÖµ½Ó×Åͨ¹ý±àÂë×Ö·û¼¯±»Ó³Éäµ½ËüÃDZíʾµÄ×Ö·û¡£

Unicode µÄ¶¨Òå
Unicode ͨ³£ÓÃ×÷Éæ¼°Ë«×Ö½Ú×Ö·û±àÂë·½°¸µÄͨÓÃÊõÓï¡£Unicode CCS 3.1 µÄ¹Ù·½³ÆνÊÇ ISO10646-1 ͨÓöà°Ë×Ö½Ú±àÂë×Ö·û¼¯£¨Universal Multiple Octet Coded Character Set£¬UCS£©¡£Unicode 3.1 °æ±¾Ìí¼ÓÁË 44,946 ¸öеıàÂë×Ö·û¡£ËãÉÏ Unicode 3.0 °æ±¾ÒѾ­´æÔÚµÄ 49,194 ¸ö×Ö·û£¬¹²¼Æ 94,140 ¸ö¡£

Unicode ±àÂë×Ö·û¼¯ÀûÓÃÁËÒ»¸öÓÉ 128 ¸öÈýάµÄ×é¹¹³ÉµÄËÄά±àÂë¿Õ¼ä¡£ÆäÖÐÿ¸ö×é°üº¬ 256 ¸ö¶þάƽÃ档ÿ¸öƽÃæÓÉ 256 ¸öһάµÄÐÐ×é³É£¬²¢ÇÒÿ¸öÐÐÓÐ 256 ¸öµ¥Ôª¡£Ã¿¸öµ¥ÔªÔÚÕâ¸ö±àÂë¿Õ¼äÄÚ¶ÔÒ»¸ö×Ö·û±àÂ룬»òÕß±»ÉùÃ÷Ϊδ¾­Ê¹Óá£ÕâÖÖ±àÂë¸ÅÄî±»³ÆΪ UCS-4£»Ëĸö°ËλԪÓÃÀ´±íʾָ¶¨×顢ƽÃæ¡¢Ðк͵¥ÔªµÄÿ¸ö×Ö·û¡£

µÚÒ»¸öƽÃ棨µÚ 00 ×éµÄµÚ 00 ƽÃ棩ÊÇ»ù±¾¶àÓïÑÔƽÃ棨Basic Multilingual Plane£¬BMP£©¡£BMP °´×Öĸ¡¢Òô½Ú¡¢±íÒâ·ûºÅºÍ¸÷ÖÖ·ûºÅ¼°Êý×Ö¶¨ÒåÁ˳£¹æʹÓõÄ×Ö·û¡£ºóÐøµÄƽÃæÓÃÓÚ¸½¼Ó×Ö·û»òÆäËü»¹Ã»Óз¢Ã÷µÄ±àÂëʵÌå¡£ÎÒÃÇÐèÒªÕâÍêÕûµÄ·¶Î§È¥´¦ÀíÊÀ½çÉϵÄËùÓÐÓïÑÔ£»ÌرðÊÇÓµÓн«½ü 64,000 ¸ö×Ö·ûµÄһЩ¶«ÑÇÓïÑÔ¡£

BMP ±»ÓÃ×÷Ë«×ֽڵıàÂë×Ö·û¼¯£¬ÕâÖÖ±àÂë×Ö·û¼¯È·¶¨Îª ISO 10646 UCS-2 ¸ñʽ¡£ISO 10646 UCS-2 ¾ÍÊÇÖ¸ Unicode£¨²¢ÇÒÁ½ÕßÏàͬ£©¡£BMP£¬ÏñËùÓÐ UCS ƽÃæÄÇÑù£¬°üº¬ÁË 256 ÐУ¬ÆäÖÐÿÐаüº¬ 256 ¸öµ¥Ôª£¬×Ö·û½ö½ö°´ÕÕ BMP ÖеÄÐк͵¥ÔªµÄ°ËλԪÔÚµ¥ÔªÖб»±àÂë¡£Õâ¾ÍÔÊÐí 16 λ±àÂë×Ö·ûÄܹ»±»ÓÃÀ´Êéд´ó¶àÊýÉÌÒµÉÏ×îÖØÒªµÄÓïÑÔ¡£UCS-2 ²»ÐèÒª´úÂëÒ³Çл»¡¢´úÂëÀ©Õ¹»ò´úÂë״̬¡£UCS-2 ÊÇÒ»ÖÖ½« Unicode ½áºÏµ½Èí¼þÖеļòµ¥·½·¨£¬µ«ËüÖ»ÏÞÓÚÖ§³Ö Unicode BMP¡£

ÈôÒªÓà 8 λ×Ö½Ú±íʾһ¸ö¶àÓÚ 2^8 =256 ¸ö×Ö·ûµÄ×Ö·û±àÂëϵͳ£¨character coding system£¬CCS£©£¬¾ÍÐèÒªÒ»ÖÖ×Ö·û±àÂë·½°¸(character-encoding scheme£¬CES£©¡£

Unicode ת»»
ÔÚ UNIX ÖУ¬Ê¹ÓõÃ×î¶àµÄ×Ö·û±àÂë·½°¸ÊÇ UTF-8¡£Ëü¿¼Âǵ½Á˶ÔÕû¸ö Unicode È«²¿Ò³ºÍƽÃæµÄÈ«ÃæÖ§³Ö£¬¶øÇÒËüÈÔÄÜÕýÈ·µÄʶ±ð ASCII¡£³ýÁË UTF-8 µÄÆäËûÑ¡Ôñ»¹ÓУºUCS-4¡¢UTF-16¡¢UTF-7.5¡¢UTF-7¡¢SCSU¡¢HTML ºÍ JAVA¡£

Unicode ת»»¸ñʽ£¨Unicode Transformation Formats£¬UTFs£©ÊÇÒ»ÖÖͨ¹ýÓ³Éä¶à×Ö½Ú±àÂëÖеÄÖµÀ´Ö§³Ö Unicode µÄ×Ö·û±àÂë·½°¸¡£±¾ÎĽ«·ÖÎö×îÁ÷Ðеĸñʽ -- UTF-8 ×Ö·û±àÂëϵͳ¡£

UTF-8
UTF-8 ת»»¸ñʽÕýÖð²½³ÉΪһÖÖÕ¼Ö÷µ¼µØλµÄ½»»»¹ú¼ÊÎı¾ÐÅÏ¢µÄ·½·¨£¬ÒòΪËü¿ÉÒÔÖ§³ÖÊÀ½çÉÏËùÓеÄÓïÑÔ£¬¶øÇÒËü»¹Óë ASCII ¼æÈÝ¡£UTF-8 ʹÓñ䳤±àÂë¡£´Ó 0 µ½ 0x7f£¨127£©µÄ×Ö·û°Ñ×ÔÉí±àÂë³Éµ¥×Ö½Ú£¬¶ø½«Öµ¸ü´óµÄ×Ö·û±àÂë³É 2 µ½ 6 ¸ö×Ö½Ú¡£

±í 1. UTF-8 ±àÂë

0x00000000 - 0x0000007F: 0xxxxxxx
0x00000080 - 0x000007FF: 110xxxxx 10xxxxxx
0x00000800 - 0x0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx
0x00010000 - 0x001FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
0x00200000 - 0x03FFFFFF: 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
0x04000000 - 0x7FFFFFFF: 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

×Ö½Ú 10xxxxxx ÊÇÒ»¸öÀ©Õ¹×Ö½Ú£¬ËüµÄ xxxxxx λλÖñ»ÒÔ¶þ½øÖƱíʾµÄ×Ö·û´úÂëºÅµÄλËùÌî³ä¡£ÕâÊÇÄܹ»´ú±í±»Ê¹ÓôúÂëµÄ×î¶ÌµÄ¿ÉÄܵĶà×Ö½ÚÐòÁС£

UTF-8 ±àÂëʾÀý
Unicode ×Ö·û°æȨ±ê¼Ç×Ö·û 0xA9 = 1010 1001 Óà UTF-8 ±àÂëÈçÏÂËùʾ£º

11000010 10101001 = 0xC2 0xA9
¡°²»µÈÓÚ¡±·ûºÅ×Ö·û 0x2260 = 0010 0010 0110 0000 ±àÂëÈçÏÂËùʾ£º

11100010 10001001 10100000 = 0xE2 0x89 0xA0
ͨ¹ý»ñÈ¡ continuation byte µÄÖµ¿ÉÒÔ¿´µ½Ô­Ê¼Êý¾Ý£º

[1110]0010 [10]001001 [10]100000
0010 001001 100000
0010 0010 0110 0000 = 0x2260
µÚÒ»¸ö×Ö½Ú¶¨ÒåºóÃæ½ô¸úµÄ°ËλԪÊý£¬Èç¹ûÊÇ 7F »ò¸üС£¬Õâ¾ÍÊÇµÈ¼ÛµÄ ASCII Öµ¡£Ã¿¸ö°Ëλ×Ö½ÚÒÔ 10xxxxxx ¿ªÍ·£¬È·±£×Ö½Ú²»Óë ASCII µÄÖµ»ìÏý¡£

UTF Ö§³Ö
ÔÚ Linux ƽ̨ÉÏʹÓà UTF-8 ֮ǰ£¬ÇëÈ·ÐÅ·Ö·¢°üÀïÓÐ glibc 2.2 ºÍ XFree86 4.0 »ò¸üеİ汾¡£ÔçÏȵİ汾ȱÉÙ UTF-8 ÓïÑÔ»·¾³Ö§³ÖºÍ ISO10646-1 X11 ×ÖÌå¡£

ÔÚ UTF-8 ·¢²¼Ö®Ç°£¬Linux Óû§Ê¹Óø÷ÖÖ²»Í¬Ìض¨ÓïÑÔµÄÀ©Õ¹ ASCII£¬ÏñÅ·ÖÞÓû§Óà ISO 8859-1 »ò ISO 8859-2£¬Ï£À°Óû§Ê¹Óà ISO 8859-7£¬¶íÂÞ˹Óû§Ê¹Óà KOI-8 / ISO 8859-5/CP1251£¨Î÷Àï¶û×Öĸ£©¡£ÕâʹµÃÊý¾Ý½»»»³öÏÖÁ˺ܶàÎÊÌ⣬²¢ÇÒÐèҪΪÕâЩ±àÂëÖ®¼äµÄ²îÒì±àдӦÓÃÈí¼þ¡£ÕâÖÖÓïÑÔÖ§³ÖÊDz»ÍêÉƵģ¬¶øÇÒÊý¾Ý½»»»Ã»Óо­¹ý²âÊÔ¡£Linux Ö÷ÒªµÄ·¢ÐÐÉ̺ÍÓ¦ÓóÌÐò¿ª·¢ÕßÕýÖÂÁ¦ÓÚÈÃÖ÷ÒªÒÔ UTF-8 ¸ñʽ±íʾµÄ Unicode ³ÉΪ Linux Öеıê×¼¡£

ΪÁËʶ±ð Unicode Îļþ£¬Microsoft ½¨ÒéËùÓÐµÄ Unicode ÎļþÓ¦¸ÃÒÔ ZERO WIDTH NOBREAK SPACE£¨U+FEFF£©×Ö·û¿ªÍ·¡£Õâ×÷Ϊһ¸ö¡°ÌØÕ÷·û¡±»ò¡°×Ö½Ú˳Ðò±ê¼Ç£¨byte-order mark£¬BOM£©¡±À´Ê¶±ðÎļþÖÐʹÓõıàÂëºÍ×Ö½Ú˳Ðò¡£µ«ÊÇ£¬Linux/UNIX ²¢Ã»ÓÐʹÓà BOM£¬ÒòΪËü»áÆÆ»µÏÖÓÐµÄ ASCII ÎļþµÄÓï·¨Ô¼¶¨¡£ÔÚ POSIX ϵͳÖУ¬Ñ¡ÖеÄÓïÑÔ»·¾³Ê¶±ðÁËÔÚÒ»¸ö¹ý³ÌÖеÄËùÓÐÊäÈëÊä³öÎļþÆÚÍûµÄ±àÂëÐÎʽ¡£

ÓÐÁ½ÖÖ·½·¨¿ÉÒÔ½« UTF-8 Ö§³ÖÌí¼Óµ½ Linux Ó¦ÓóÌÐòÖС£µÚÒ»ÖÖ·½·¨£¬Êý¾Ý¶¼ÒÔ UTF-8 ÐÎʽ´æ·ÅÔÚ¸÷´¦£¬ÕâÑùÈí¼þ¸Ä¶¯ºÜÉÙ£¨±»¶¯µÄ£©¡£ÁíÒ»ÖÖ·½·¨£¬±»¶ÁÈ¡µÄ UTF-8 Êý¾ÝÓñê×¼µÄ C ÓïÑԿ⺯Êýת±ä³ÉΪ¿í×Ö·ûÊý×飨ת»»µÄ£©¡£ÔÚÊä³öʱ£¬Óú¯Êý wcsrtombs() ʹ×Ö·û´®±»×ª±ä»Ø UTF-8£º

Çåµ¥ 1. wcsrtombs()
#include
size_t wcsrtombs (char *dest, const wchar_t **src, size_t len, mbstate_t *ps);


·½·¨µÄÑ¡ÔñÈ¡¾öÓÚÓ¦ÓóÌÐòµÄÐÔÖÊ¡£´ó¶àÊýÓ¦ÓóÌÐò¿ÉÒÔʹÓñ»¶¯µÄ·½·¨²Ù×÷¡£Õâ¾ÍÊÇÔÚ UNIX ƽ̨ÉÏʹÓà UTF-8 »áÈç´ËÁ÷ÐеÄÔ­Òò¡£Ïñ cat ºÍ echo ÄÇÑùµÄ³ÌÐò¾Í²»ÐèÒªÐ޸ġ£×Ö½ÚÁ÷ÈÔÖ»ÊÇ×Ö½ÚÁ÷£¬²¢Ã»ÓжÔËü½øÐÐÈκδ¦Àí¡£ASCII ×Ö·ûºÍ¿ØÖÆ´úÂëÔÚ UTF-8 ÓïÑÔ»·¾³Öв»¸Ä±ä¡£

ͨ¹ý×Ö½Ú¼ÆÊý¶Ô×Ö·û½øÐмÆÊýµÄ³ÌÐòÐèҪһЩССµÄ¸Ä¶¯¡£ÔÚ UTF-8 ÖÐÓ¦ÓóÌÐò²»¶ÔÈκÎÀ©Õ¹µÄ×Ö½Ú½øÐмÆÊý¡£Èç¹ûÑ¡ÔñÁË UTF-8 ÓïÑÔ»·¾³£¬C ÓïÑÔ¿âµÄ strlen(s) º¯ÊýÐèÒªÓà mbstowcs() º¯ÊýÀ´´úÌ棺

Çåµ¥ 2. mbstowcs() º¯Êý
#include
size_t mbstowcs(wchar_t *pwcs, const char *s, size_t n);


strlen µÄÒ»ÖÖ³£¼ûÓ÷¨ÊǹÀËãÏÔʾ¿í¶È¡£ÖÐÎĺÍÆäËü±íÒâ·ûºÅ½«Õ¼ÓÃÁ½ÁÐλÖá£wcwidth() º¯ÊýÓÃÀ´²âÊÔÿ¸ö×Ö·ûµÄÏÔʾ¿í¶È£º

Çåµ¥ 3. wcwidth() º¯Êý
#include
int wcwidth(wchar_t wc);


Unicode µÄ C ÓïÑÔÖ§³Ö
ÔÚÕýʽÇé¿öÏ£¬´Ó GNU glibc 2.2 ¿ªÊ¼£¬wchar_t ÀàÐÍֻΪ 32 λµÄ ISO 10646 ¸ñʽÊýÖµËùÌض¨Ê¹Óã¬Ó뵱ǰʹÓõÄÓïÑÔ»·¾³Î޹ء£Í¨¹ý ISO C99 ËùÒªÇóµÄ __STDC_ISO_10646__ ºêµÄ¶¨Òå×÷ΪÐźÅ֪ͨӦÓóÌÐò¡£ __STDC_ISO_10646__ µÄ¶¨ÒåÓÃÀ´Ö¸³ö wchar_t ÊÇ Unicode¡£¾«È·µÄÖµÊÇÒ»¸öÊ®½øÖÆµÄ yyyymmL ¸ñʽµÄ³£Êý¡£ÀýÈ磬ʹÓãº

Çåµ¥ 4. Ö¸³ö wchar_t ÊÇ Unicode
#define __STDC_ISO_10646__ 200104L


ÊÇΪָ³ö wchar_t ÀàÐ͵ÄÖµÊÇÓÉ ISO/IEC 10646 ºÍµ½Ö¸¶¨µÄÄêÔÂΪֹµÄËùÓÐÐÞÕýÓë¼¼Êõ¿±Îó¶¨ÒåµÄ×Ö·û±àÂë±íʾ¡£

¶Ô wchar_t µÄÀûÓÃÈçÕâ¸öʾÀýËùʾ£¬Ê¹ÓúêÈ·¶¨ÔÚ ISO C99 ¿ÉÒÆÖ²´úÂëÖÐд˫ÒýºÅµÄ·½·¨¡£

Çåµ¥ 5. È·¶¨Ð´Ë«ÒýºÅµÄ·½·¨
#if __STDC_ISO_10646__
printf("%lc", 0x201c);
#else
putchar('"');
#fi


ÓïÑÔ»·¾³
¼¤»î UTF-8 µÄÇ¡µ±µÄ°ì·¨ÊÇ POSIX ÓïÑÔ»·¾³»úÖÆ¡£ÓïÑÔ»·¾³ÊÇÒ»ÖÖ°üº¬ÓйØÈí¼þÐÐΪÌض¨ÎÄ»¯Ô¼¶¨µÄÅäÖÃÉ趨¡£Ëü°üº¬ÁË×Ö·û±àÂë¡¢ÈÕÆÚ£¯Ê±¼ä·ûºÅ¡¢·ÖÀà¹æÔòÒÔ¼°¶ÈÁ¿ÏµÍ³¡£ÓïÑÔ»·¾³µÄÃû³Æͨ³£ÓÉ ISO 639-1 ÓïÑÔ¡¢ISO 3166-1 ¹ú¼Ò»òµØÇø´úÂëÒÔ¼°¿ÉÑ¡µÄ±àÂëÃû³ÆºÍÆäËüÏÞ¶¨·û×é³É¡£Äú¿ÉÒÔÓÃÃüÁî locale -a »ñÈ¡ËùÓа²×°ÔÚϵͳÉϵÄÓïÑÔ»·¾³ÁÐ±í£¨Í¨³£ÔÚ /usr/lib/locale/£©¡£

Èç¹ûûÓÐÔ¤°²×° UTF-8 ÓïÑÔ»·¾³£¬Äã¿ÉÒÔÓà localedef ÃüÁîÉú³ÉËü¡£ÈôҪΪij¸öÌض¨Óû§Éú³É²¢¼¤»îÒ»¸öµÂÓïµÄ UTF-8 ÓïÑÔ»·¾³£¬ÇëʹÓÃÈçÏÂÓï¾ä£º

Çåµ¥ 6. ΪÌض¨Óû§Éú³ÉÓïÑÔ»·¾³
localedef -v -c -i de_DE -f UTF-8 $HOME/local/locale/de_DE.UTF-8
export LOCPATH=$HOME/local/locale
export LANG=de_DE.UTF-8


ÓÐʱºòΪËùÓÐÓû§Ìí¼Ó UTF-8 ÓïÑÔ»·¾³»áºÜÓÐÓá£root Óû§Ê¹ÓÃÈçÏÂÖ¸Áî¾Í¿ÉÒÔÍê³É£º

Çåµ¥ 7. Ϊÿ¸öÓû§Éú³ÉÓïÑÔ»·¾³
localedef -v -c -i de_DE -f UTF-8 /usr/share/locale/de_DE.UTF-8


ÈôҪΪÿ¸öÓû§½«Õâ¸öÓïÑÔ»·¾³ÉèΪȱʡֵ£¬¿ÉÒÔ½«ÒÔÏÂÐÐÌí¼Óµ½ /etc/profile ÎļþÖУº

Çåµ¥ 8. ΪËùÓÐÓû§ÉèÖÃȱʡµÄÓïÑÔ»·¾³
export LANG=de_DE.UTF-8


´¦Àí¶à×Ö½Ú×Ö·û´úÂëÐòÁеĺ¯ÊýÐÐΪÒÀÀµÓÚµ±Ç°ÓïÑÔ»·¾³µÄ LC_CTYPE Àà±ð£»ËüÈ·¶¨ÁËÒÀÀµÓïÑÔ»·¾³µÄ¶à×Ö½Ú±àÂë¡£Öµ LANG=de_DE£¨µÂÓ»áµ¼ÖÂÊä³ö°´ ISO 8859-1 ±»¸ñʽ»¯¡£Öµ LANG=de_DE.UTF-8 »á°ÑÊä³ö¸ñʽ»¯³É UTF-8¡£ÓïÑÔ»·¾³ÉèÖûᵼÖ printf ÖÐµÄ %ls ¸ñʽ˵Ã÷·ûµ÷Óà wcsrtombs() º¯ÊýÒÔ±ãÓÚ½«¿í×Ö·ûµÄ²ÎÊý×Ö·û´®×ª»»³ÉÒÀÀµÓïÑÔ»·¾³µÄ¶à×Ö½Ú±àÂë¡£ÓïÑÔ»·¾³ÖеĹú¼Ò»òµØÇø±êʶ·ûÈ磺LC_CTYPE= en_GB £¨Ó¢¹úÓ¢ÓºÍ LC_CTYPE= en_AU£¨°Ä´óÀûÑÇÓ¢Ó£¬ËüÃÇÖ®¼äµÄ²îÒìÖ»ÔÚ LC_MONETARY Àà±ðÖУ¬Ô­ÒòÔÚÓÚ»õ±ÒµÄÃû³ÆºÍ´òÓ¡»õ±ÒÊýÁ¿µÄ¹æÔò²»Í¬¡£

Çë¸øÄúÊ×Ñ¡µÄÓïÑÔ»·¾³ÉèÖû·¾³±äÁ¿ LANG¡£µ±Ò»¸ö C ³ÌÐòÖ´ÐÐ setlocale() º¯Êýʱ£º

Çåµ¥ 9. setlocale() º¯Êý
#include
#include
//char *setlocale(int category, const char *locale);
int main()
{
if (!setlocale(LC_CTYPE, ""))
{
fprintf(stderr, "Locale not specified. Check LANG, LC_CTYPE, LC_ALL.
");
return 1;
}


C ÓïÑԿ⽫»áÒÀ´Î²âÊÔ»·¾³±äÁ¿ LC_ALL¡¢LC_CTYPE ºÍ LANG¡£ÆäÖеÚÒ»¸öº¬ÖµµÄ»·¾³±äÁ¿½«¾ö¶¨Îª LC_CTYPE Àà±ð×°ÈëÄÄÖÖÓïÑÔ»·¾³Êý¾Ý¡£ÓïÑÔ»·¾³Êý¾Ý·ÖÁѳɶÀÁ¢µÄÀà±ð¡£Öµ LC_CTYPE ¶¨ÒåÁË×Ö·û±àÂ룬¶ø LC_COLLATE ¶¨ÒåÁËÅÅÐò˳Ðò¡£ÎÒÃÇÓà LANG »·¾³±äÁ¿ÎªËùÓÐÀà±ðÉèÖÃȱʡÓïÑÔ»·¾³£¬µ« LC_* ±äÁ¿¿ÉÒÔÓÃÀ´¸²¸Çµ¥¸öÀà±ð¡£

Äú¿ÉÒÔÓÃÃüÁî locale charmap ²éѯµ±Ç°ÓïÑÔ»·¾³ÖÐ×Ö·û±àÂëµÄÃû³Æ¡£Èç¹ûÄú´Ó LC_CTYPE Àà±ðÖгɹ¦Ñ¡È¡ÁË UTF-8 ÓïÑÔ»·¾³£¬»áÊä³ö UTF-8¡£ÃüÁî locale -m ÌṩһÕÅÒÑ°²×°µÄËùÓÐ×Ö·û±àÂëÃû³ÆµÄÁÐ±í¡£

Èç¹ûÄúʹÓÃרÃÅµÄ C ÓïÑÔ¿âµÄ¶à×Ö½Úº¯ÊýÀ´Íê³ÉËùÓÐÍⲿ×Ö·û±àÂëºÍÄÚ²¿Ê¹ÓÃµÄ wchar_t ±àÂëÖ®¼äµÄת»»£¬ÄÇô C ÓïÑԿ⽫³Ðµ£ÔðÈΣ¬¸ù¾Ý LC_CTYPE ʹÓÃÕýÈ·µÄ±àÂ뷽ʽ¡£ÕâÉõÖÁ²»ÐèÒª³ÌÐò±»Ã÷È·µÄ±àÂë³Éµ±Ç°µÄ¶à×Ö½Ú±àÂë¡£

Èç¹ûÐèÒªÒ»¸öÓ¦ÓóÌÐòÄÜÃ÷È·µÄÖ§³Ö UTF-8£¨»òÆäËü±àÂ룩ת»»·½·¨¶ø²»Óà libc ¶à×Ö½Úº¯Êý£¬ÔòÓ¦ÓóÌÐò±ØÐëÈ·¶¨ÊÇ·ñÐèÒª¼¤»î UTF-8 ģʽ¡£´øÓÐ ¿âÍ·ÎļþµÄÓë X/Open ¼æÈÝϵͳ¿ÉÒÔÓÃÈçÏ´úÂ룺

Çåµ¥ 10. ¼ì²âµ±Ç°µÄÓïÑÔ»·¾³ÊÇ·ñʹÓÃÁË UTF-8 ±àÂë
BOOL utf8_mode = FALSE;

if( ! strcmp(nl_langinfo(CODESET), "UTF-8")
utf8_mode = TRUE;


Ϊ¼ì²âµ±Ç°ÓïÑÔ»·¾³ÊÇ·ñʹÓÃÁË UTF-8 ±àÂë¡£Ê×ÏȱØÐëµ÷Óà setlocale(LC_CTYPE, "") º¯Êý£¬ÒÀ¾Ý»·¾³±äÁ¿ÉèÖÃÓïÑÔ»·¾³¡£nl_langinfo(CODESET) º¯ÊýÒ²ÊÇÓÉ locale charmap ÃüÁîµ÷Ó㬴Ӷø²éÕÒµ±Ç°ÓïÑÔ»·¾³Ö¸¶¨µÄ±àÂëÃû³Æ¡£

ÁíÒ»ÖÖ¿ÉÒÔʹÓõķ½·¨ÊDzéѯÓïÑÔ»·¾³±äÁ¿£º

Çåµ¥ 11. ²éѯÓïÑÔ»·¾³±äÁ¿
char *s;
BOOL utf8_mode = FALSE;

if ((s = getenv("LC_ALL")) || (s = getenv("LC_CTYPE")) || (s = getenv ("LANG")))

{
if (strstr(s, "UTF-8"))
utf8_mode = TRUE;
}


ÕâÏî²âÊÔ¼ÙÉè UTF-8 ÓïÑÔ»·¾³Ãû³ÆÖÐÓÐÖµ¡°UTF-8¡±£¬µ«Êµ¼ÊÇé¿ö²¢²»×ÜÊÇÈç´Ë£¬ËùÒÔÓ¦¸ÃʹÓà nl_langinfo() ·½·¨¡£

×ܽá
Ϊ֧³ÖÊÀ½çÉϵÄËùÓÐÓïÑÔ£¬ÐèÒªÒ»ÖÖ¾ßÓаËλ×Ö½Ú×Ö·û±àÂë²ßÂÔµÄ×Ö·û±àÂëϵͳ£¬ËüµÄ×Ö·ûÓ¦¶àÓÚ ASCII£¨Ò»ÖÖʹÓÃÎÞ·ûºÅ×Ö½ÚµÄÀ©Õ¹°æ±¾£©µÄ 2^8 = 256 ¸ö×Ö·û¡£Unicode ¾ÍÊÇÕâÑùÒ»ÖÖ×Ö·û±àÂëϵͳ£¬Ëü¾ßÓÐÓÉ 128 ¸öÈýά×飨´øÓÐÓÉ´óÁ¿×Ö·û±àÂë·½°¸µÄ·½·¨Ö§³ÖµÄ 94,140 ¸ö¶¨ÒåºÃµÄ×Ö·ûÖµ£©×é³ÉµÄËÄά±àÂë¿Õ¼ä£¬ÔÚ Linux ÖиüÁ÷ÐеÄ×Ö·û±àÂë·½°¸ÊÇ Unicode ת»»¸ñʽ UTF-8¡£

²Î¿¼×ÊÁÏ

Çë·ÃÎÊ Unicode ÁªÃ赀 Unicode Ö÷Ò³£¬ÕâÀﶨÒåÁË Unicode ×Ö·ûÖ®¼äµÄÐÐΪºÍ¹Øϵ£¬²¢ÎªÊµÏÖÕßÌṩÁ˼¼ÊõÐÅÏ¢¡£
¹ú¼Ê±ê×¼×éÖ¯£¨International Organization for Standardization£¬ISO£©ÊÇÒ»¸öÓÉ 140 ¸ö¹ú¼Ò×é³ÉµÄÈ«ÇòÐԵĹú¼Ò±ê×¼ÉçÍÅÁªÃË¡£
ANSI ÊǸö˽Óеġ¢·ÇÓªÀû×éÖ¯£¬Ëü¹ÜÀí²¢µ÷Õû U.S. µÄÖ¾Ô¸±ê×¼»¯ÒÔ¼°Ò»ÖÂÐÔÆÀ¼Ûϵͳ¡£
ISO C99 Draft£¨Acrobat PDF ¸ñʽ£¬556 Ò³£©£¬ÊÇÐ嵀 C ÓïÑÔ±ê×¼£¬À´×Ô Calgary ´óѧ Ben µÄ C ±à³Ì¿Î³Ì¡£
C ÓïÑÔµÄРISO ±ê×¼ÌÖÂÛÁË C9x ±ê×¼¡£
ÇëÔĶÁ Roman Czyborra µÄ Unix »·¾³Ï嵀 Unicode¡£
Çë²éÔÄÓÉ David A. Wheeler ׫дµÄ Secure Programming for Linux and Unix HOWTO ÖÐµÄ Character Encoding Õ½ڡ£
ÇëÔĶÁ IANA£¨Internet Assigned Numbers Authority£©ÖÐµÄ IANA Charset Registration Procedures¡£
Çë²ÎÔÄ Virginia ´óѧͼÊé¹Ý Robertson Media ÖÐÐÄµÄ Unicode Music Symbols¡£
Çë¿´¿´ graphic representation of the Roadmap to the BMP, Plane 0 of the UCS¡£ÕâЩ±í°üº¬ÁËÓÉ 0 ºÅ£¬Ò²¾ÍÊÇͨÓÃ×Ö·û¼¯£¨Universal Character Set£¬UCS£©µÄ»ù±¾¶àÓïÑÔƽÃ棨Basic Multilingual Plane£¬BMP£©Êµ¼Ê´óСµÄÓ³Éä×é³ÉµÄ¡£Everson Gunn Teoranta ÊÇÒ»¸ö×Ô 1990 Ä꿪°ìµÄÖ§³ÖÉÙÊýÃñ×åÓïÑÔÍÅÌåµÄÈí¼þºÍ³ö°æ¹«Ë¾£¬ÓÉ Michael Everson ºÍ Marion Gunn ¹²Í¬½¨Á¢¡£
Çëä¯ÀÀ UTF-8 and Unicode FAQ for UNIX/Linux£¬Markus Kuhn µÄ×ÛºÏÐ﵀ one-stop ÐÅÏ¢×ÊÔ´£¬¹ØÓÚÄúÈçºÎÔÚ POSIX ϵͳ£¨Linux£¬UNIX£©Ê¹Óà Unicode/UTF-8¡£
Çë¼ì²é Technology Appraisals Ltd µÄ Solution Given by the Universal Character Set£¬ÆäÖÐÌṩÁ˶ÀÁ¢µÄ¡¢¸ßÖÊÁ¿µÄÓйصç×ÓÉÌÎñϵͳ¡¢µç×ÓÐÅÏ¢´«µÝ¡¢XML¡¢ÍøÂçºÍ IT °²È«µÄÐÅÏ¢¡¢½ÌÓýºÍÅàѵ¡£
ÇëÔĶÁ Mulberry Technologies, Inc µÄ Unicode presentation titled¡°10646 and All That¡±£¬Ò»¸öר¹¥»ùÓÚ SGML ºÍ XML ϵͳµÄµç×Ó³ö°æÎïµÄ×Éѯ¹«Ë¾¡£
UTF-8, a transformation format of ISO 10646 ÊÇÓÉ¶íº¥¶íÖÝÁ¢´óѧµÄ¼ÆËã»úºÍÐÅÏ¢¿Æѧϵָ¶¨µÄÒòÌØÍøÉçÇøµÄÒòÌØÍø±ê×¼¸ú×ÙЭÒé¡£
Çë×Éѯ Linux ³ÌÐòÔ±ÊÖ²áÉ쵀 UTF-8 -- an ASCII compatible multi-byte Unicode encoding¡£
ÇëÔĶÁ Unicode Standard Annex#15 Unicode Normalization Forms£¬Ò»ÆªÃèдÁËËÄÖÖ Unicode Îı¾±ê×¼»¯¸ñʽ¹æ·¶µÄÎĵµ¡£ÓÐÁËÕâЩ¸ñʽ£¬µÈ¼ÛµÄ£¨¹æ·¶»òÊǼæÈݵģ©Îı¾½«»áÓÐͬÑùµÄ¶þ½øÖƱíʽ¡£µ±ÊµÏÖ¹¤¾ßÔÚ±ê×¼»¯µÄ¸ñʽÖб£ÁôÁËÒ»¸ö×Ö·û´®£¬¿ÉÒÔÈ·±£ÓÐÒ»¸öÒÔ¶þ½øÖÆÐÎʽ±íÏֵĶÀÒ»ÎÞ¶þµÄµÈ¼Û×Ö·û´®¡£
ÇëÔĶÁ man-pages.net É쵀 mbstowcs£¬Ëü°Ñ¶à×Ö½Ú×Ö·û´®×ª»»³ÉÁË¿í×Ö·ûµÄ×Ö·û´®£¬man-pages.net Ϊ Linux ÊÖ²áÒ³ÃæÌṩÁËÓÀ¾ÃµÄ»ùÓÚ Web µÄ¹éµµÎļþ¡£
ÇëÔĶÁ Menlo ѧУµÄÖ÷Ò³É쵀 wcwidth£¬ËüÄܾö¶¨Ò»¸ö¿í×Ö·û´úÂëÖµµÄËùÕ¼ÁÐλÖõÄÁÐÊý¡£
ÇëÔĶÁ Hewlett Packard µÄ¿ª·¢Õß×ÊÔ´Õ¾µãµÄ Linux ³ÌÐòÔ±ÊÖ²áÉ쵀 wcsrtombs£¬ËüÄܽ«¿í×Ö·ûµÄ×Ö·û´®×ª»¯Îª¶à×Ö½Ú×Ö·û´®¡£
ÇëÔĶÁ MKS ¹¤¾ßÏäÎĵµÖÐµÄ setlocale()£¬ËüÄܸıä»ò²éѯÓïÑÔ»·¾³¡£MKS Èí¼þ¹«Ë¾ÊÇÔÚ Windows »·¾³»ò»ìºÏ UNIX/Linux ºÍ Windows »·¾³ÖÐÓÃÓÚϵͳ¹ÜÀíºÍ¿ª·¢µÄ Windows ×Ô¶¯»¯¹¤¾ßµÄÁìÏȹ©Ó¦ÉÌ¡£
Çëѧϰ IBM Classes for Unicode (ICU)£¬Ò»¸ö C ÓïÑÔºÍ C++ ÓïÑԿ⣬ËüÔÚÐí¶àƽ̨ÉÏÌṩÁ˽¡×³µÄºÍ¹¦ÄÜÍêÉÆµÄ Unicode Ö§³Ö¡£
Çë²ÎÔÄ IBM µÄ¡°Introduction to Unicode¡±Õ¾µã£¬ÕâÀïÉîÈ뺭¸ÇÁË Unicode »ù´¡ÖªÊ¶¡£
ÔÚ IBM µÄ¹ØÓÚÐÂÐ˼¼ÊõµÄ alphaWorks Õ¾µã¡£Çë²ÎÔÄ£º
UnicodeCompressor£¬ÕâÀïÌṩÁËʹÓñê×¼ Unicode ѹËõ·½°¸µÄѹËõºÍ½âѹËõ Unicode Îı¾µÄ¹¤¾ß
Unicode Normalizer£¬ÎªÊµÏÖ¿ìËÙÅÅÐòºÍËÑË÷½« Java ×Ö·û´®¶ÔÏóת»»Îª±ê×¼ Unicode ¸ñʽ¡£
ÇëÔĶÁ TW Burger ׫дµÄ¡°Cyrillic in Unicode¡±ºÍ Jim Melnick ׫дµÄ¡°Multilingual forms in Unicode¡±£¬Ò²ÔÚ developerWorks ÉÏ¡£
ÇëÔÚ developerWorks ÉÏä¯ÀÀ¸ü¶à Linux ²Î¿¼×ÊÁÏ¡£
ÇëÔÚ developerWorks ÉÏä¯ÀÀ¸ü¶à Unicode ²Î¿¼×ÊÁÏ¡£

¹ØÓÚ×÷Õß
TW Burger ´Ó 1979 ÄêÆðÔø¾­×ö¹ý±à³Ì¡¢½²ÊÚÖеȼÆËã»ú¿Î³ÌÒÔ¼°×«Ð´ÓйؼÆËã»ú¼¼Êõ·½ÃæµÄÊé¡£ËûÕýÔÚ¾­ÓªÒ»¸öÐÅÏ¢¼¼Êõ×Éѯ¹«Ë¾¡£Äú¿ÉÒÔͨ¹ý twburger@bigfoot.com ÓëËûÁªÏµ¡£
ÎÄÕÂÆÀÂÛ

¹²ÓÐ 0 ÌõÆÀÂÛ