09. Characters and Grapheme Clusters

Characters and Grapheme Clusters

It's common to think of a string as a sequence of characters, but when working with NSString objects, or with Unicode strings in general, in most cases it is better to deal with substrings rather than with individual characters. The reason for this is that what the user perceives as a character in text may in many cases be represented by multiple characters in the string. NSString has a large inventory of methods for properly handling Unicode strings, which in general make Unicode compliance easy, but there are a few precautions you should observe.

將字符串視為一系列字符是很常見的论寨，但是當(dāng)使用NSString對象或一般使用Unicode字符串時谆沃，在大多數(shù)情況下，最好處理子字符串而不是單個字符汗盘。其原因在于，在許多情況下，用戶認(rèn)為文本中的字符可以由字符串中的多個字符表示。 NSString有大量的方法可以正確處理Unicode字符串问畅，這通常會使Unicode合規(guī)性變得容易，但是您應(yīng)該遵循一些預(yù)防措施。

NSString objects are conceptually UTF-16 with platform endianness. That doesn't necessarily imply anything about their internal storage mechanism; what it means is that NSString lengths, character indexes, and ranges are expressed in terms of UTF-16 units, and that the term “character” in NSString method names refers to 16-bit platform-endian UTF-16 units. This is a common convention for string objects. In most cases, clients don't need to be overly concerned with this; as long as you are dealing with substrings, the precise interpretation of the range indexes is not necessarily significant.

NSString對象在概念上是UTF-16护姆，具有平臺字節(jié)序矾端。這并不一定意味著他們的內(nèi)部存儲機(jī)制; 這意味著NSString長度，字符索引和范圍用UTF-16單位表示卵皂，并且NSString方法名稱中的術(shù)語“字符”指的是16位平臺端字符UTF-16單位须床。這是字符串對象的常見約定。在大多數(shù)情況下渐裂，客戶不需要過分關(guān)注這一點; 只要您處理子串豺旬，范圍索引的精確解釋就不一定重要。

The vast majority of Unicode code points used for writing living languages are represented by single UTF-16 units. However, some less common Unicode code points are represented in UTF-16 by surrogate pairs. A surrogate pair is a sequence of two UTF-16 units, taken from specific reserved ranges, that together represent a single Unicode code point. CFString has functions for converting between surrogate pairs and the UTF-32 representation of the corresponding Unicode code point. When dealing with NSString objects, one constraint is that substring boundaries usually should not separate the two halves of a surrogate pair. This is generally automatic for ranges returned from most Cocoa methods, but if you are constructing substring ranges yourself you should keep this in mind. However, this is not the only constraint you should consider.

用于編寫活語言的絕大多數(shù)Unicode代碼點由單個UTF-16單元表示柒凉。但是族阅，一些不太常見的Unicode代碼點由代理對以UTF-16表示。代理對是兩個UTF-16單元的序列膝捞，取自特定的保留范圍坦刀，它們一起代表單個Unicode代碼點。 CFString具有用于在代理對和相應(yīng)Unicode代碼點的UTF-32表示之間進(jìn)行轉(zhuǎn)換的功能蔬咬。處理NSString對象時鲤遥，一個約束是子串邊界通常不應(yīng)該分隔代理對的兩半。對于大多數(shù)Cocoa方法返回的范圍林艘，這通常是自動的盖奈，但如果您自己構(gòu)建子字符串范圍，則應(yīng)記住這一點狐援。但是钢坦，這不是您應(yīng)該考慮的唯一約束。

In many writing systems, a single character may be composed of a base letter plus an accent or other decoration. The number of possible letters and accents precludes Unicode from representing each combination as a single code point, so in general such combinations are represented by a base character followed by one or more combining marks. For compatibility reasons, Unicode does have single code points for a number of the most common combinations; these are referred to as precomposed forms, and Unicode normalization transformations can be used to convert between precomposed and decomposed representations. However, even if a string is fully precomposed, there are still many combinations that must be represented using a base character and combining marks. For most text processing, substring ranges should be arranged so that their boundaries do not separate a base character from its associated combining marks.

在許多書寫系統(tǒng)中啥酱，單個字符可以由基本字母加上重音或其他裝飾組成爹凹。可能的字母和重音的數(shù)量使得Unicode不能將每個組合表示為單個代碼點，因此通常這樣的組合由基本字符后跟一個或多個組合標(biāo)記表示镶殷。出于兼容性原因禾酱，Unicode確實為許多最常見的組合提供單個代碼點; 這些被稱為預(yù)合成形式，Unicode規(guī)范化轉(zhuǎn)換可用于在預(yù)合成和分解表示之間進(jìn)行轉(zhuǎn)換绘趋。但是颤陶，即使字符串是完全預(yù)先組合的，仍然有許多組合必須使用基本字符和組合標(biāo)記來表示埋心。對于大多數(shù)文本處理指郁，應(yīng)排列子字符串范圍忙上，使其邊界不會將基本字符與其關(guān)聯(lián)的組合標(biāo)記分開拷呆。

In addition, there are writing systems in which characters represent a combination of parts that are more complicated than accent marks. In Korean, for example, a single Hangul syllable can be composed of two or three subparts known as jamo. In the Indic and Indic-influenced writing systems common throughout South and Southeast Asia, single written characters often represent combinations of consonants, vowels, and marks such as viramas, and the Unicode representations of these writing systems often use code points for these individual parts, so that a single character may be composed of multiple code points. For most text processing, substring ranges should also be arranged so that their boundaries do not separate the jamo in a single Hangul syllable, or the components of an Indic consonant cluster.

另外，存在書寫系統(tǒng)，其中字符表示比重音符號更復(fù)雜的部分的組合茬斧。例如腰懂，在韓語中，單個韓語音節(jié)可以由稱為jamo的兩個或三個子部分組成项秉。在南亞和東南亞常見的印度語和印度語寫作系統(tǒng)中绣溜，單個書寫字符通常表示輔音，元音和標(biāo)記（如變形記）的組合娄蔼，這些書寫系統(tǒng)的Unicode表示通常使用這些單獨部分的代碼點怖喻，這樣單個字符可以由多個代碼點組成。對于大多數(shù)文本處理岁诉，還應(yīng)該排列子字符串范圍锚沸，使得它們的邊界不會將單個韓文音節(jié)中的干擾或印度語輔音聚類的組件分開。

In general, these combinations—surrogate pairs, base characters plus combining marks, Hangul jamo, and Indic consonant clusters—are referred to as grapheme clusters. In order to take them into account, you can use NSString’s rangeOfComposedCharacterSequencesForRange: or rangeOfComposedCharacterSequenceAtIndex: methods, or CFStringGetRangeOfComposedCharactersAtIndex. These can be used to adjust string indexes or substring ranges so that they fall on grapheme cluster boundaries, taking into account all of the constraints mentioned above. These methods should be the default choice for programmatically determining the boundaries of user-perceived characters.:

通常涕癣，這些組合 - 代理對哗蜈，基本字符加組合標(biāo)記，Hangul jamo和印度語輔音簇 - 被稱為字形簇坠韩。為了將它們考慮在內(nèi)距潘，您可以使用NSString的rangeOfComposedCharacterSequencesForRange：或rangeOfComposedCharacterSequenceAtIndex：方法或CFStringGetRangeOfComposedCharactersAtIndex。這些可以用于調(diào)整字符串索引或子字符串范圍只搁，以便它們落在字形簇邊界上音比，同時考慮到上面提到的所有約束。這些方法應(yīng)該是以編程方式確定用戶感知字符邊界的默認(rèn)選擇：

In some cases, Unicode algorithms deal with multiple characters in ways that go beyond even grapheme cluster boundaries. Unicode casing algorithms may convert a single character into multiple characters when going from lowercase to uppercase; for example, the standard uppercase equivalent of the German character “?” is the two-letter sequence “SS”. Localized collation algorithms in many languages consider multiple-character sequences as single units; for example, the sequence “ch” is treated as a single letter for sorting purposes in some European languages. In order to deal properly with cases like these, it is important to use standard NSString methods for such operations as casing, sorting, and searching, and to use them on the entire string to which they are to apply. Use NSString methods such as lowercaseString, uppercaseString, capitalizedString, compare: and its variants, rangeOfString: and its variants, and rangeOfCharacterFromSet: and its variants, or their CFString equivalents. These all take into account the complexities of Unicode string processing, and the searching and sorting methods in particular have many options to control the types of equivalences they are to recognize.

在某些情況下氢惋，Unicode算法以超出字形集群邊界的方式處理多個字符硅确。 Unicode套管算法可以在從小寫變?yōu)榇髮憰r將單個字符轉(zhuǎn)換為多個字符;例如，德語字符“?”的標(biāo)準(zhǔn)大寫等價物是雙字母序列“SS”明肮。許多語言中的本地化校對算法將多字符序列視為單個單元;例如菱农，序列“ch”被視為單個字母，用于在某些歐洲語言中進(jìn)行排序柿估。為了正確處理這些情況循未，重要的是使用標(biāo)準(zhǔn)的NSString方法進(jìn)行封裝，排序和搜索等操作秫舌，并在它們要應(yīng)用的整個字符串上使用它們的妖。使用NSString方法，例如lowercaseString足陨，uppercaseString嫂粟，capitalizedString，compare：及其變體墨缘，rangeOfString：及其變體星虹，rangeOfCharacterFromSet：及其變體零抬，或其CFString等價物。這些都考慮了Unicode字符串處理的復(fù)雜性宽涌，特別是搜索和排序方法有許多選項來控制它們要識別的等價類型平夜。

In some less common cases, it may be necessary to tailor the definition of grapheme clusters to a particular need. The issues involved in determining and tailoring grapheme cluster boundaries are covered in detail in Unicode Standard Annex #29, which gives a number of examples and some algorithms. The Unicode standard in general is the best source for information about Unicode algorithms and the considerations involved in processing Unicode strings.

在一些不太常見的情況下，可能有必要根據(jù)特定需要定制字素集群的定義卸亮。 Unicode標(biāo)準(zhǔn)附件＃29詳細(xì)介紹了確定和定制字形集群邊界所涉及的問題忽妒，其中給出了許多示例和一些算法。通常兼贸，Unicode標(biāo)準(zhǔn)是有關(guān)Unicode算法的信息以及處理Unicode字符串所涉及的注意事項的最佳來源段直。

If you are interested in grapheme cluster boundaries from the point of view of cursor movement and insertion point positioning, and you are using the Cocoa text system, you should know that on OS X v10.5 and later, NSLayoutManager has API support for determining insertion point positions within a line of text as it is laid out. Note that insertion point boundaries are not identical to glyph boundaries; a ligature glyph in some cases, such as an “fi” ligature in Latin script, may require an internal insertion point on a user-perceived character boundary. See Cocoa Text Architecture Guide for more information.

如果您從光標(biāo)移動和插入點定位的角度對字形簇邊界感興趣，并且您正在使用Cocoa文本系統(tǒng)溶诞，您應(yīng)該知道在OS X v10.5及更高版本中坷牛，NSLayoutManager具有用于確定插入的API支持在布置的一行文本中指出位置。請注意很澄，插入點邊界與字形邊界不同; 在某些情況下京闰，例如拉丁文字中的“fi”連字，連字字形可能需要在用戶感知的字符邊界上使用內(nèi)部插入點甩苛。有關(guān)更多信息蹂楣，請參閱Cocoa Text Architecture Guide。

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者

人面猴
序言：七十年代末讯蒲，一起剝皮案震驚了整個濱河市痊土，隨后出現(xiàn)的幾起案子，更是在濱河造成了極大的恐慌墨林，老刑警劉巖赁酝，帶你破解...
沈念sama閱讀 212,884評論 6贊 492
死咒
序言：濱河連續(xù)發(fā)生了三起死亡事件，死亡現(xiàn)場離奇詭異旭等，居然都是意外死亡酌呆，警方通過查閱死者的電腦和手機(jī)，發(fā)現(xiàn)死者居然都...
沈念sama閱讀 90,755評論 3贊 385
救了他兩次的神仙讓他今天三更去死
文/潘曉璐我一進(jìn)店門搔耕，熙熙樓的掌柜王于貴愁眉苦臉地迎上來隙袁，“玉大人，你說我怎么就攤上這事弃榨∑惺眨” “怎么了？”我有些...
開封第一講書人閱讀 158,369評論 0贊 348
道士緝兇錄：失蹤的賣姜人
文/不壞的土叔我叫張陵鲸睛，是天一觀的道長娜饵。經(jīng)常有香客問我，道長官辈，這世上最難降的妖魔是什么箱舞？我笑而不...
開封第一講書人閱讀 56,799評論 1贊 285
?港島之戀（遺憾婚禮）
正文為了忘掉前任遍坟，我火速辦了婚禮，結(jié)果婚禮上褐缠，老公的妹妹穿的比我還像新娘政鼠。我一直安慰自己风瘦，他們只是感情好队魏，可當(dāng)我...
茶點故事閱讀 65,910評論 6贊 386
惡毒庶女頂嫁案：這布局不是一般人想出來的
文/花漫我一把揭開白布。她就那樣靜靜地躺著万搔，像睡著了一般胡桨。火紅的嫁衣襯著肌膚如雪。梳的紋絲不亂的頭發(fā)上瞬雹，一...
開封第一講書人閱讀 50,096評論 1贊 291
城市分裂傳說
那天昧谊，我揣著相機(jī)與錄音，去河邊找鬼酗捌。笑死呢诬，一個胖子當(dāng)著我的面吹牛，可吹牛的內(nèi)容都是我干的胖缤。我是一名探鬼主播尚镰，決...
沈念sama閱讀 39,159評論 3贊 411
雙鴛鴦連環(huán)套：你想象不到人心有多黑
文/蒼蘭香墨我猛地睜開眼，長吁一口氣：“原來是場噩夢啊……” “哼哪廓！你這毒婦竟也來了狗唉？” 一聲冷哼從身側(cè)響起，我...
開封第一講書人閱讀 37,917評論 0贊 268
萬榮殺人案實錄
序言：老撾萬榮一對情侶失蹤涡真，失蹤者是張志新（化名）和其女友劉穎分俯，沒想到半個月后，有當(dāng)?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體哆料，經(jīng)...
沈念sama閱讀 44,360評論 1贊 303
?護(hù)林員之死
正文獨居荒郊野嶺守林人離奇死亡缸剪，尸身上長有42處帶血的膿包…… 初始之章·張勛以下內(nèi)容為張勛視角年9月15日...
茶點故事閱讀 36,673評論 2贊 327
?白月光啟示錄
正文我和宋清朗相戀三年，在試婚紗的時候發(fā)現(xiàn)自己被綠了东亦。大學(xué)時的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片橄登。...
茶點故事閱讀 38,814評論 1贊 341
活死人
序言：一個原本活蹦亂跳的男人離奇死亡，死狀恐怖讥此，靈堂內(nèi)的尸體忽然破棺而出拢锹，到底是詐尸還是另有隱情，我是刑警寧澤萄喳，帶...
沈念sama閱讀 34,509評論 4贊 334
?日本核電站爆炸內(nèi)幕
正文年R本政府宣布卒稳，位于F島的核電站，受9級特大地震影響他巨，放射性物質(zhì)發(fā)生泄漏充坑。R本人自食惡果不足惜减江，卻給世界環(huán)境...
茶點故事閱讀 40,156評論 3贊 317
男人毒藥：我在死后第九天來索命
文/蒙蒙一、第九天我趴在偏房一處隱蔽的房頂上張望捻爷。院中可真熱鬧辈灼，春花似錦、人聲如沸也榄。這莊子的主人今日做“春日...
開封第一講書人閱讀 30,882評論 0贊 21
一樁弒父案，背后竟有這般陰謀
文/蒼蘭香墨我抬頭看了看天上的太陽甜紫。三九已至降宅，卻和暖如春，著一層夾襖步出監(jiān)牢的瞬間囚霸，已是汗流浹背腰根。一陣腳步聲響...
開封第一講書人閱讀 32,123評論 1贊 267
情欲美人皮
我被黑心中介騙來泰國打工，沒想到剛下飛機(jī)就差點兒被人妖公主榨干…… 1. 我叫王不留拓型，地道東北人额嘿。一個月前我還...
沈念sama閱讀 46,641評論 2贊 362
代替公主和親
正文我出身青樓，卻偏偏與公主長得像劣挫，于是被迫代替她去往敵國和親册养。傳聞我的和親對象是個殘疾皇子，可洞房花燭夜當(dāng)晚...
茶點故事閱讀 43,728評論 2贊 351

09. Characters and Grapheme Clusters

Characters and Grapheme Clusters

推薦閱讀更多精彩內(nèi)容