Characters and Grapheme Clusters
It's common to think of a string as a sequence of characters, but when working with NSString
objects, or with Unicode strings in general, in most cases it is better to deal with substrings rather than with individual characters. The reason for this is that what the user perceives as a character in text may in many cases be represented by multiple characters in the string. NSString
has a large inventory of methods for properly handling Unicode strings, which in general make Unicode compliance easy, but there are a few precautions you should observe.
- 將字符串視為一系列字符是很常見的论寨,但是當(dāng)使用
NSString
對象或一般使用Unicode字符串時谆沃,在大多數(shù)情況下,最好處理子字符串而不是單個字符汗盘。 其原因在于,在許多情況下,用戶認(rèn)為文本中的字符可以由字符串中的多個字符表示。NSString
有大量的方法可以正確處理Unicode字符串问畅,這通常會使Unicode合規(guī)性變得容易,但是您應(yīng)該遵循一些預(yù)防措施。
NSString
objects are conceptually UTF-16 with platform endianness. That doesn't necessarily imply anything about their internal storage mechanism; what it means is that NSString
lengths, character indexes, and ranges are expressed in terms of UTF-16 units, and that the term “character” in NSString
method names refers to 16-bit platform-endian UTF-16 units. This is a common convention for string objects. In most cases, clients don't need to be overly concerned with this; as long as you are dealing with substrings, the precise interpretation of the range indexes is not necessarily significant.
-
NSString
對象在概念上是UTF-16护姆,具有平臺字節(jié)序矾端。 這并不一定意味著他們的內(nèi)部存儲機(jī)制; 這意味著NSString
長度,字符索引和范圍用UTF-16單位表示卵皂,并且NSString
方法名稱中的術(shù)語“字符”指的是16位平臺端字符UTF-16單位须床。 這是字符串對象的常見約定。 在大多數(shù)情況下渐裂,客戶不需要過分關(guān)注這一點; 只要您處理子串豺旬,范圍索引的精確解釋就不一定重要。
The vast majority of Unicode code points used for writing living languages are represented by single UTF-16 units. However, some less common Unicode code points are represented in UTF-16 by surrogate pairs. A surrogate pair is a sequence of two UTF-16 units, taken from specific reserved ranges, that together represent a single Unicode code point. CFString has functions for converting between surrogate pairs and the UTF-32 representation of the corresponding Unicode code point. When dealing with NSString
objects, one constraint is that substring boundaries usually should not separate the two halves of a surrogate pair. This is generally automatic for ranges returned from most Cocoa methods, but if you are constructing substring ranges yourself you should keep this in mind. However, this is not the only constraint you should consider.
- 用于編寫活語言的絕大多數(shù)Unicode代碼點由單個UTF-16單元表示柒凉。 但是族阅,一些不太常見的Unicode代碼點由代理對以UTF-16表示。 代理對是兩個UTF-16單元的序列膝捞,取自特定的保留范圍坦刀,它們一起代表單個Unicode代碼點。 CFString具有用于在代理對和相應(yīng)Unicode代碼點的UTF-32表示之間進(jìn)行轉(zhuǎn)換的功能蔬咬。 處理
NSString
對象時鲤遥,一個約束是子串邊界通常不應(yīng)該分隔代理對的兩半。 對于大多數(shù)Cocoa方法返回的范圍林艘,這通常是自動的盖奈,但如果您自己構(gòu)建子字符串范圍,則應(yīng)記住這一點狐援。 但是钢坦,這不是您應(yīng)該考慮的唯一約束。
In many writing systems, a single character may be composed of a base letter plus an accent or other decoration. The number of possible letters and accents precludes Unicode from representing each combination as a single code point, so in general such combinations are represented by a base character followed by one or more combining marks. For compatibility reasons, Unicode does have single code points for a number of the most common combinations; these are referred to as precomposed forms, and Unicode normalization transformations can be used to convert between precomposed and decomposed representations. However, even if a string is fully precomposed, there are still many combinations that must be represented using a base character and combining marks. For most text processing, substring ranges should be arranged so that their boundaries do not separate a base character from its associated combining marks.
- 在許多書寫系統(tǒng)中啥酱,單個字符可以由基本字母加上重音或其他裝飾組成爹凹。 可能的字母和重音的數(shù)量使得Unicode不能將每個組合表示為單個代碼點,因此通常這樣的組合由基本字符后跟一個或多個組合標(biāo)記表示镶殷。 出于兼容性原因禾酱,Unicode確實為許多最常見的組合提供單個代碼點; 這些被稱為預(yù)合成形式,Unicode規(guī)范化轉(zhuǎn)換可用于在預(yù)合成和分解表示之間進(jìn)行轉(zhuǎn)換绘趋。 但是颤陶,即使字符串是完全預(yù)先組合的,仍然有許多組合必須使用基本字符和組合標(biāo)記來表示埋心。 對于大多數(shù)文本處理指郁,應(yīng)排列子字符串范圍忙上,使其邊界不會將基本字符與其關(guān)聯(lián)的組合標(biāo)記分開拷呆。
In addition, there are writing systems in which characters represent a combination of parts that are more complicated than accent marks. In Korean, for example, a single Hangul syllable can be composed of two or three subparts known as jamo. In the Indic and Indic-influenced writing systems common throughout South and Southeast Asia, single written characters often represent combinations of consonants, vowels, and marks such as viramas, and the Unicode representations of these writing systems often use code points for these individual parts, so that a single character may be composed of multiple code points. For most text processing, substring ranges should also be arranged so that their boundaries do not separate the jamo in a single Hangul syllable, or the components of an Indic consonant cluster.
- 另外,存在書寫系統(tǒng),其中字符表示比重音符號更復(fù)雜的部分的組合茬斧。 例如腰懂,在韓語中,單個韓語音節(jié)可以由稱為jamo的兩個或三個子部分組成项秉。 在南亞和東南亞常見的印度語和印度語寫作系統(tǒng)中绣溜,單個書寫字符通常表示輔音,元音和標(biāo)記(如變形記)的組合娄蔼,這些書寫系統(tǒng)的Unicode表示通常使用這些單獨部分的代碼點怖喻, 這樣單個字符可以由多個代碼點組成。 對于大多數(shù)文本處理岁诉,還應(yīng)該排列子字符串范圍锚沸,使得它們的邊界不會將單個韓文音節(jié)中的干擾或印度語輔音聚類的組件分開。
In general, these combinations—surrogate pairs, base characters plus combining marks, Hangul jamo, and Indic consonant clusters—are referred to as grapheme clusters. In order to take them into account, you can use NSString
’s rangeOfComposedCharacterSequencesForRange: or rangeOfComposedCharacterSequenceAtIndex: methods, or CFStringGetRangeOfComposedCharactersAtIndex. These can be used to adjust string indexes or substring ranges so that they fall on grapheme cluster boundaries, taking into account all of the constraints mentioned above. These methods should be the default choice for programmatically determining the boundaries of user-perceived characters.:
- 通常涕癣,這些組合 - 代理對哗蜈,基本字符加組合標(biāo)記,Hangul jamo和印度語輔音簇 - 被稱為字形簇坠韩。 為了將它們考慮在內(nèi)距潘,您可以使用NSString的rangeOfComposedCharacterSequencesForRange:或rangeOfComposedCharacterSequenceAtIndex:方法或CFStringGetRangeOfComposedCharactersAtIndex。 這些可以用于調(diào)整字符串索引或子字符串范圍只搁,以便它們落在字形簇邊界上音比,同時考慮到上面提到的所有約束。 這些方法應(yīng)該是以編程方式確定用戶感知字符邊界的默認(rèn)選擇:
In some cases, Unicode algorithms deal with multiple characters in ways that go beyond even grapheme cluster boundaries. Unicode casing algorithms may convert a single character into multiple characters when going from lowercase to uppercase; for example, the standard uppercase equivalent of the German character “?” is the two-letter sequence “SS”. Localized collation algorithms in many languages consider multiple-character sequences as single units; for example, the sequence “ch” is treated as a single letter for sorting purposes in some European languages. In order to deal properly with cases like these, it is important to use standard NSString
methods for such operations as casing, sorting, and searching, and to use them on the entire string to which they are to apply. Use NSString
methods such as lowercaseString, uppercaseString, capitalizedString, compare: and its variants, rangeOfString: and its variants, and rangeOfCharacterFromSet: and its variants, or their CFString equivalents. These all take into account the complexities of Unicode string processing, and the searching and sorting methods in particular have many options to control the types of equivalences they are to recognize.
- 在某些情況下氢惋,Unicode算法以超出字形集群邊界的方式處理多個字符硅确。 Unicode套管算法可以在從小寫變?yōu)榇髮憰r將單個字符轉(zhuǎn)換為多個字符;例如,德語字符“?”的標(biāo)準(zhǔn)大寫等價物是雙字母序列“SS”明肮。許多語言中的本地化校對算法將多字符序列視為單個單元;例如菱农,序列“ch”被視為單個字母,用于在某些歐洲語言中進(jìn)行排序柿估。為了正確處理這些情況循未,重要的是使用標(biāo)準(zhǔn)的NSString方法進(jìn)行封裝,排序和搜索等操作秫舌,并在它們要應(yīng)用的整個字符串上使用它們的妖。使用NSString方法,例如lowercaseString足陨,uppercaseString嫂粟,capitalizedString,compare:及其變體墨缘,rangeOfString:及其變體星虹,rangeOfCharacterFromSet:及其變體零抬,或其CFString等價物。這些都考慮了Unicode字符串處理的復(fù)雜性宽涌,特別是搜索和排序方法有許多選項來控制它們要識別的等價類型平夜。
In some less common cases, it may be necessary to tailor the definition of grapheme clusters to a particular need. The issues involved in determining and tailoring grapheme cluster boundaries are covered in detail in Unicode Standard Annex #29, which gives a number of examples and some algorithms. The Unicode standard in general is the best source for information about Unicode algorithms and the considerations involved in processing Unicode strings.
- 在一些不太常見的情況下,可能有必要根據(jù)特定需要定制字素集群的定義卸亮。 Unicode標(biāo)準(zhǔn)附件#29詳細(xì)介紹了確定和定制字形集群邊界所涉及的問題忽妒,其中給出了許多示例和一些算法。 通常兼贸,Unicode標(biāo)準(zhǔn)是有關(guān)Unicode算法的信息以及處理Unicode字符串所涉及的注意事項的最佳來源段直。
If you are interested in grapheme cluster boundaries from the point of view of cursor movement and insertion point positioning, and you are using the Cocoa text system, you should know that on OS X v10.5 and later, NSLayoutManager has API support for determining insertion point positions within a line of text as it is laid out. Note that insertion point boundaries are not identical to glyph boundaries; a ligature glyph in some cases, such as an “fi” ligature in Latin script, may require an internal insertion point on a user-perceived character boundary. See Cocoa Text Architecture Guide for more information.
- 如果您從光標(biāo)移動和插入點定位的角度對字形簇邊界感興趣,并且您正在使用Cocoa文本系統(tǒng)溶诞,您應(yīng)該知道在OS X v10.5及更高版本中坷牛,NSLayoutManager具有用于確定插入的API支持 在布置的一行文本中指出位置。 請注意很澄,插入點邊界與字形邊界不同; 在某些情況下京闰,例如拉丁文字中的“fi”連字,連字字形可能需要在用戶感知的字符邊界上使用內(nèi)部插入點甩苛。 有關(guān)更多信息蹂楣,請參閱Cocoa Text Architecture Guide。