我們經(jīng)常會(huì)碰到string
矫钓,byte slices
以及rune
之間的相互轉(zhuǎn)化問(wèn)題旦万,現(xiàn)簡(jiǎn)單介紹一下莽红。
String
本質(zhì)上是只讀的slice of bytes
竭业。
indexing a string yields its bytes, not its characters: a string is just a bunch of bytes.
rune
是int32
的別名,代表字符的Unicode編碼国葬,采用4個(gè)字節(jié)存儲(chǔ)贤徒,將string轉(zhuǎn)成rune就意味著任何一個(gè)字符都用4個(gè)字節(jié)來(lái)存儲(chǔ)其unicode值,這樣每次遍歷的時(shí)候返回的就是unicode值汇四,而不再是字節(jié)了接奈。
String
is immutable byte sequence.Byte slice
is mutable byte sequence.Rune
slice is re-grouping of byte slice so that each index is a character.// rune is an alias for int32 and is equivalent to int32 in all ways. It is // used, by convention, to distinguish character values from integer values. type rune = int32
下面我們定義placeOfInterest
為 raw string
, 其由反引號(hào) back quotes
包圍著, 因此它僅僅只能包含literal text
。
func main() {
const placeOfInterest = `?`
fmt.Printf("plain string: ")
fmt.Printf("%s", placeOfInterest)
fmt.Printf("\n")
fmt.Printf("quoted string: ")
fmt.Printf("%+q", placeOfInterest)
fmt.Printf("\n")
fmt.Printf("hex bytes: ")
for i := 0; i < len(placeOfInterest); i++ {
fmt.Printf("%x ", placeOfInterest[I])
}
for _, ch := range placeOfInterest {
fmt.Printf("\nUnicode character: %c", ch)
}
fmt.Printf("\nThe length of placeOfInterest: %d", len(placeOfInterest))
fmt.Printf("\n")
const Chinese = "中國(guó)話"
fmt.Println(len(Chinese))
for index, runeValue := range Chinese {
fmt.Printf("%#U starts at byte position %d\n", runeValue, index)
}
}
輸出結(jié)果為:
plain string: ?
quoted string: "\u2318"
hex bytes: e2 8c 98
Unicode character: ?
The length of placeOfInterest: 3
9
U+4E2D '中' starts at byte position 0
U+56FD '國(guó)' starts at byte position 3
U+8BDD '話' starts at byte position 6
從上面輸出結(jié)果可以看出:
- 符號(hào)?的
Unicode character
值為U+2318
船殉,其由三個(gè)字節(jié)組成:e2 8c 98
。它們是UTF-8
編碼表示的16進(jìn)制值2318
斯嚎。 - 通過(guò)
for range
對(duì)字符串進(jìn)行遍歷時(shí)利虫,每次獲取到的對(duì)象都是rune
類型的挨厚。而for循環(huán)遍歷輸出的是各個(gè)字節(jié)。 - go采用的是
UTF-8
編碼糠惫,即go的源代碼是被定義成UTF-8文本形式的疫剃,其他形式的表述是不被允許的。這就是說(shuō)硼讽,當(dāng)我們?cè)诖a中寫下?
時(shí)巢价,程序?qū)⒎?hào)?
的UTF-8編碼寫入源代碼文本中。因此當(dāng)我們打印16進(jìn)制bytes時(shí)固阁,我們只是將編輯器放置在文件中的數(shù)據(jù)給dump下來(lái)了而已壤躲。 - 使用
len
函數(shù)獲取到string的長(zhǎng)度并不是字符個(gè)數(shù),而是字節(jié)個(gè)數(shù)备燃。 - Unicode標(biāo)準(zhǔn)使用碼點(diǎn)
code point
來(lái)表示a single value
所表述的item
碉克。例如符號(hào)?,其16進(jìn)制值為2318并齐,其code point 為U+2318漏麦。
但是由于Code point
比較繞口,因此go引進(jìn)了一個(gè)新的詞匯項(xiàng)rune
來(lái)表示况褪。rune
經(jīng)常出現(xiàn)在library和源代碼中撕贞,它基本上就和Code point
一樣,但是go語(yǔ)言將rune
表示為int32的alias测垛,這樣通過(guò)一個(gè)整形值來(lái)代表Code point
將更加清晰明了捏膨。因此,在Golang中我們可以將character constant
稱為rune constant
赐纱。表達(dá)式'?'
的類型和值分別為rune
脊奋,整形值0x2318
.
需要注意的是:
Unicode
只是一個(gè)符號(hào)集,它只規(guī)定了符號(hào)的二進(jìn)制代碼疙描,卻沒(méi)有規(guī)定這個(gè)二進(jìn)制代碼應(yīng)該如何存儲(chǔ)诚隙。而UTF-8 就是在互聯(lián)網(wǎng)上使用最廣的一種 Unicode 的實(shí)現(xiàn)方式。
UTF-8
最大的一個(gè)特點(diǎn)起胰,就是它是一種變長(zhǎng)的編碼方式久又。它可以使用1~4
個(gè)字節(jié)表示一個(gè)符號(hào),根據(jù)不同的符號(hào)而變化字節(jié)長(zhǎng)度效五。
UTF-8
編碼格式為:
- 對(duì)于單字節(jié)的符號(hào)地消,字節(jié)的第一位設(shè)為0,后面7位為這個(gè)符號(hào)的 Unicode 碼畏妖。因此對(duì)于英語(yǔ)字母脉执,UTF-8 編碼和 ASCII 碼是相同的。
-
對(duì)于n字節(jié)的符號(hào)(n > 1)戒劫,第一個(gè)字節(jié)的前n位都設(shè)為1半夷,第n + 1位設(shè)為0婆廊,后面字節(jié)的前兩位一律設(shè)為10。剩下的沒(méi)有提及的二進(jìn)制位巫橄,全部為這個(gè)符號(hào)的 Unicode 碼淘邻。
UTF-8編碼格式
總結(jié)
- Go source code is always UTF-8.
- A string holds arbitrary bytes.
- A string literal, absent byte-level escapes, always holds valid UTF-8 sequences. Some people think Go strings are always UTF-8, but they are not: only string literals are UTF-8. As we showed in the previous section, string values can contain arbitrary bytes; as we showed in this one, string literals always contain UTF-8 text as long as they have no byte-level escapes. To summarize, strings can contain arbitrary bytes, but when constructed from string literals, those bytes are (almost always) UTF-8.
- Those sequences represent Unicode code points, called runes.
- No guarantee is made in Go that characters in strings are normalized.
String
is a nice way to deal with short sequence, of bytes or characters. Everytime you operate on string, such as find replace string or take substring, a new string is created. This is very inefficient if string is huge, such as file content. [see Golang: String]Byte slice
is just like string, but mutable. i.e. you can modify each byte or character. This is very efficient for working with file content, either as text file, binary file, or IO stream from networking. [see Golang: Slice]Rune slice
is like byte slice, except that each index is a character instead of a byte. This is best if you work with text files that have lots non-ASCII characters, such as Chinese text or math formulas ∑ or text with emoji ? . [see Golang: Rune]