編碼字符集
ASCII:最先出現(xiàn)的編碼字符集幔睬,包含了大小寫的從A到Z和符號,用8位表示芹扭,共258個字符麻顶,老美一開始只固定了前127個字符(稱為半角),而后面127個字符是在計算機在其他歐美國家開始使用時擴展的冯勉,是擴展字符集(全角)澈蚌。
GB2312和GBK:當(dāng)中國開始使用計算機表示漢字時,ASCII已經(jīng)沒有空間可以給漢字字符填充灼狰,所以中國索性把兩個連在一起的大于127的ASCII字符當(dāng)作一個漢字宛瞄,這個方案稱為GB2312;當(dāng)GB2312不足夠表示所有漢字時交胚,中國規(guī)定 兩個連在一起的第一個字符大于127的兩個ASCII字符當(dāng)作一個漢字份汗,稱為GBK方案。因此會出現(xiàn):一個漢字字符相當(dāng)于兩個英文字符的情況蝴簇。
Unicode:當(dāng)計算機在全世界廣泛傳播時杯活,出現(xiàn)了許多編碼字符集,各個編碼字符集之間無法相互識別熬词,當(dāng)同時出現(xiàn)在同一篇文檔中會出現(xiàn)亂碼旁钧。因此國際標(biāo)準(zhǔn)組織ISO出臺了一套16位的字符編碼方案以總括現(xiàn)有的各個編碼字符集吸重, 稱為Unicode。在互聯(lián)網(wǎng)出現(xiàn)之后歪今,ISO規(guī)定了每次傳輸16位的方案稱為UTF-16
字符集編碼
Unicode是無法用16位表示所有文字字符的嚎幸,隨著不斷有文字填充,必將使用更多位表示寄猩,這就將導(dǎo)致ASCII中的半角前面許多位都是0嫉晶,白白浪費了空間。因此出現(xiàn)了Unicode的字符集編碼方案
UTF-16:Unicode最開始的編碼方案田篇,籠統(tǒng)地用兩個字節(jié)表示一個字符替废,不能解決空間浪費的問題
UTF-8:網(wǎng)絡(luò)每次傳輸8位,可變長度的編碼方案泊柬,可由1~4個字節(jié)表示一個字符椎镣,增加標(biāo)識符以表示多少個字節(jié)表示一個字符,更加自由彬呻,解決了空間浪費問題衣陶。但也存在問題,有些文字由于增加了多個標(biāo)識符闸氮,導(dǎo)致需要多個字節(jié)表示剪况,如一個漢字字符需要三個字節(jié)表示。
Java中的編碼字符集
I.java.lang.Character類規(guī)定了java使用的編碼字符集蒲跨,從java.lang.Character類注釋的解讀译断,我們可以知道:
Java使用了Unicode編碼字符集,具體來說Java中的字符數(shù)值范圍是從0X0000到0x10FFFF或悲,而0x0000到0xFFFF支持UTF-16編碼方案孙咪,稱為BMP(Basic Multilingual Plane 基礎(chǔ)多語言面);大于0XFFFF的字符即是擴展字符巡语,大小相當(dāng)于兩個char類型字符翎蹈。
char類型只支持BMP(即UTF-16包含的字符),char類型數(shù)據(jù)的value是一個Character類型數(shù)據(jù)男公,如’\u005CuD840’荤堪,它代表的是該char類型數(shù)據(jù)在Unicode database中指向的字符,即Character類型不等同于char類型枢赔, 如Character.isLetter(’\u005CuD840’)返回的是false澄阳,因為Character.isLetter(char char)要求的是傳入一個char類型的數(shù)據(jù),但該語句中傳入的是Character類型的數(shù)據(jù)
int類型除了支持BMP外還支持?jǐn)U展字符踏拜,如 Character.isLetter(0x2F81A)返回的是true碎赢,0x2F81A是一個擴展字符的數(shù)值.注釋原文如下:
* <p>A {@code char} value, therefore, represents Basic
* Multilingual Plane (BMP) code points, including the surrogate
* code points, or code units of the UTF-16 encoding. An
* {@code int} value represents all Unicode code points,
* including supplementary code points. The lower (least significant)
* 21 bits of {@code int} are used to represent Unicode code
* points and the upper (most significant) 11 bits must be zero.
* Unless otherwise specified, the behavior with respect to
* supplementary characters and surrogate {@code char} values is
* as follows:
*
* <ul>
* <li>The methods that only accept a {@code char} value cannot support
* supplementary characters. They treat {@code char} values from the
* surrogate ranges as undefined characters. For example,
* {@code Character.isLetter('\u005CuD840')} returns {@code false}, even though
* this specific value if followed by any low-surrogate value in a string
* would represent a letter.
*
* <li>The methods that accept an {@code int} value support all
* Unicode characters, including supplementary characters. For
* example, {@code Character.isLetter(0x2F81A)} returns
* {@code true} because the code point value represents a letter
* (a CJK ideograph).
* </ul>
*
* <p>In the Java SE API documentation, <em>Unicode code point</em> is
* used for character values in the range between U+0000 and U+10FFFF,
* and <em>Unicode code unit</em> is used for 16-bit
* {@code char} values that are code units of the <em>UTF-16</em>
- II.Character類中提供了判斷字符是BMP還是擴展字符的方法
public static final char MIN_VALUE = '\u0000';
public static final char MAX_VALUE = '\uFFFF';
public static final int MIN_CODE_POINT = 0x000000;
public static final int MAX_CODE_POINT = 0X10FFFF;
public static boolean isBmpCodePoint(int codePoint) {
return codePoint >>> 16 == 0;
// Optimized form of:
// codePoint >= MIN_VALUE && codePoint <= MAX_VALUE
// We consistently use logical shift (>>>) to facilitate
// additional runtime optimizations.
}
public static boolean isValidCodePoint(int codePoint) {
// Optimized form of:
// codePoint >= MIN_CODE_POINT && codePoint <= MAX_CODE_POINT
int plane = codePoint >>> 16;
return plane < ((MAX_CODE_POINT + 1) >>> 16);
}
提供了將擴展字符從數(shù)值轉(zhuǎn)為字符類型的方法
//從此處我們可以發(fā)現(xiàn),一個擴展字符相當(dāng)兩個char類字符
static void toSurrogates(int codePoint, char[] dst, int index) {
// We write elements "backwards" to guarantee all-or-nothing
dst[index+1] = lowSurrogate(codePoint);
dst[index] = highSurrogate(codePoint);
}
學(xué)習(xí)內(nèi)容來自周華健的網(wǎng)課《[9節(jié)課征服「字符編碼」]》https://edu.51cto.com/sd/1c7de