背景
許多年前 Unicode 的提出者天真地以為 16 位定長的字符可以容納地球上所有仍具活力的文字截型,Java 設(shè)計(jì)者也深以為然。
參考 Unicode 設(shè)計(jì)贵试,Java 設(shè)計(jì)者認(rèn)為完全可以設(shè)計(jì)一個(gè)雙字節(jié)數(shù)據(jù)類型來表達(dá)所有 Unicode 字符,于是便有了今天的原始數(shù)據(jù)類型 char
。
但后來發(fā)現(xiàn) 65,536 個(gè)字符根本不足以表達(dá)所有文字壹无,Java 5.0 版本既要支持 Unicode 4.0 同時(shí)要保證向后兼容性,不得不開始使用 UTF-16 作為內(nèi)部編碼方式感帅,
UTF-16 編碼
Unicode 基本多文種平面(BMP U+0000 to U+FFFF)涵蓋了幾乎所有現(xiàn)代語言斗锭,以及繁多的特殊符號(hào),Java 允許使用單個(gè) char
來表示 BMP 內(nèi)的字符失球,此時(shí)的編碼值等于 Unicode 代碼點(diǎn)(code point)拒迅,這是Java 最初的Unicode 實(shí)現(xiàn),這種編碼方式又稱之為 UCS-2。
Enough talk, show me the code !
我們嘗試打印位于 BMP 平面內(nèi)的上箭頭符號(hào)璧微。
首先作箍,查詢得知上箭頭符號(hào)對(duì)應(yīng)的 code point 是 0x2191,直接賦值給 char
然后打忧傲颉:
char ch = 0x2191;
System.out.println(ch);
輸出:
↑
那么胞得,如何表示輔助多文種平面(SMP U+010000 to U+10FFFF)內(nèi)的字符呢?
Unicode 從 BMP 平面保留兩片連續(xù)區(qū)域用于表示 SMP 平面內(nèi)的字符屹电,即可以繼續(xù)與 UCS-2 編碼保持兼容阶剑,又能減少空間浪費(fèi),畢竟使用 SMP 的場合并不多危号。
這兩片區(qū)域分別是 0xD800–0xDBFF (高代理區(qū)域)牧愁、0xDC00–0xDFFF (低代理區(qū)域),編碼方式如下:
- 將代碼點(diǎn)減去 0x10000外莲,僅保留低 20 位猪半;
- 將高 10 位加上 0xD800,得到高代理偷线;
- 將低 10 位加上 0xDC00磨确,得到低代理;
高代理和低代理共同組成一個(gè)代理串(Surrogate Pair)唯一地標(biāo)識(shí) Unicode SMP 平面上的任一代碼點(diǎn)声邦。
Enough talk, show me the code !
我們來試試打印 Emoji 笑臉
int lowBits = 0x1F600 - 0x10000;
// 由于char 的長度為 16 位乏奥,采用代理對(duì)方式表示(surrogate pair)必須使用兩個(gè) char,并使用 String 包裝
char highSurrogate = (char) ((lowBits >> 10) + 0xD800);
char lowSurrogate = (char) ((lowBits & 0x3FF) + 0xDC00);
System.out.println(new String(new char[]{highSurrogate, lowSurrogate}));
輸出:
??
Java Character
類提供很豐富的靜態(tài)方法實(shí)現(xiàn) Unicode 相關(guān)操作亥曹,如下所見:
// 將代理對(duì)轉(zhuǎn)成對(duì)應(yīng) Unicode code point
Character.toCodePoint(char high, char low)
// 判斷 code point 所需字符數(shù)
Character.charCount(int codePoint)
// 判斷 code point 是否合法
// 判斷是否為高位代理(High Surrogate)
Character.isHighSurrogate(char ch)
// 獲取高位代理(High Surrogate)
Character.highSurrogate(char ch)
// 判斷是否為低位代理(Low Surrogate)
Character.isLowSurrogate(char ch)
// 獲取低位代理(Low Surrogate)
Character.lowSurrogate(char ch)
UTF-16 轉(zhuǎn)換 UTF-8
Java String
類支持任意編碼方式轉(zhuǎn)換邓了,其中就包括 UTF-8 編碼:
String.getBytes("UTF-8")
但該方法缺點(diǎn)也很明顯,無法重用已有的 buffer媳瞪,有些場合下可能十分不便驶悟。下面是 Google 實(shí)現(xiàn)的UTF-8 編碼方法,可以供大家參考:
public class GoogleUTF8 {
public static int encodeUtf8(CharSequence in, byte[] out, int offset, int length) {
int utf16Length = in.length();
int j = offset;
int i = 0;
int limit = offset + length;
// Designed to take advantage of
// https://wikis.oracle.com/display/HotSpotInternals/RangeCheckElimination
for (char c; i < utf16Length && i + j < limit && (c = in.charAt(i)) < 0x80; i++) {
out[j + i] = (byte) c;
}
if (i == utf16Length) {
return j + utf16Length;
}
j += i;
for (char c; i < utf16Length; i++) {
c = in.charAt(i);
if (c < 0x80 && j < limit) {
out[j++] = (byte) c;
} else if (c < 0x800 && j <= limit - 2) { // 11 bits, two UTF-8 bytes
out[j++] = (byte) ((0xF << 6) | (c >>> 6));
out[j++] = (byte) (0x80 | (0x3F & c));
} else if ((c < Character.MIN_SURROGATE || Character.MAX_SURROGATE < c) && j <= limit - 3) {
// Maximum single-char code point is 0xFFFF, 16 bits, three UTF-8 bytes
out[j++] = (byte) ((0xF << 5) | (c >>> 12));
out[j++] = (byte) (0x80 | (0x3F & (c >>> 6)));
out[j++] = (byte) (0x80 | (0x3F & c));
} else if (j <= limit - 4) {
// Minimum code point represented by a surrogate pair is 0x10000, 17 bits,
// four UTF-8 bytes
final char low;
if (i + 1 == in.length()
|| !Character.isSurrogatePair(c, (low = in.charAt(++i)))) {
throw new UnpairedSurrogateException((i - 1), utf16Length);
}
int codePoint = Character.toCodePoint(c, low);
out[j++] = (byte) ((0xF << 4) | (codePoint >>> 18));
out[j++] = (byte) (0x80 | (0x3F & (codePoint >>> 12)));
out[j++] = (byte) (0x80 | (0x3F & (codePoint >>> 6)));
out[j++] = (byte) (0x80 | (0x3F & codePoint));
} else {
// If we are surrogates and we're not a surrogate pair, always throw an
// UnpairedSurrogateException instead of an ArrayOutOfBoundsException.
if ((Character.MIN_SURROGATE <= c && c <= Character.MAX_SURROGATE)
&& (i + 1 == in.length()
|| !Character.isSurrogatePair(c, in.charAt(i + 1)))) {
throw new UnpairedSurrogateException(i, utf16Length);
}
throw new ArrayIndexOutOfBoundsException("Failed writing " + c + " at index " + j);
}
}
return j;
}
}
參考鏈接
- https://docs.oracle.com/javase/specs/jls/se6/html/lexical.html
- https://docs.oracle.com/javase/tutorial/i18n/text/unicode.html
- https://softwareengineering.stackexchange.com/questions/174947/why-does-java-use-utf-16-for-internal-string-representation
- https://www.oracle.com/technetwork/articles/javase/supplementary-142654.html