09. Characters and Grapheme Clusters

相關(guān)鏈接:
https://developer.apple.com/library/archive/documentation/Cocoa/Conceptual/Strings/Articles/stringsClusters.html#//apple_ref/doc/uid/TP40008025-SW1

Characters and Grapheme Clusters

It's common to think of a string as a sequence of characters, but when working with NSString objects, or with Unicode strings in general, in most cases it is better to deal with substrings rather than with individual characters. The reason for this is that what the user perceives as a character in text may in many cases be represented by multiple characters in the string. NSString has a large inventory of methods for properly handling Unicode strings, which in general make Unicode compliance easy, but there are a few precautions you should observe.

  • 將字符串視為一系列字符是很常見的论寨,但是當(dāng)使用NSString對象或一般使用Unicode字符串時谆沃,在大多數(shù)情況下,最好處理子字符串而不是單個字符汗盘。 其原因在于,在許多情況下,用戶認(rèn)為文本中的字符可以由字符串中的多個字符表示。 NSString有大量的方法可以正確處理Unicode字符串问畅,這通常會使Unicode合規(guī)性變得容易,但是您應(yīng)該遵循一些預(yù)防措施。

NSString objects are conceptually UTF-16 with platform endianness. That doesn't necessarily imply anything about their internal storage mechanism; what it means is that NSString lengths, character indexes, and ranges are expressed in terms of UTF-16 units, and that the term “character” in NSString method names refers to 16-bit platform-endian UTF-16 units. This is a common convention for string objects. In most cases, clients don't need to be overly concerned with this; as long as you are dealing with substrings, the precise interpretation of the range indexes is not necessarily significant.

  • NSString對象在概念上是UTF-16护姆,具有平臺字節(jié)序矾端。 這并不一定意味著他們的內(nèi)部存儲機(jī)制; 這意味著NSString長度,字符索引和范圍用UTF-16單位表示卵皂,并且NSString方法名稱中的術(shù)語“字符”指的是16位平臺端字符UTF-16單位须床。 這是字符串對象的常見約定。 在大多數(shù)情況下渐裂,客戶不需要過分關(guān)注這一點; 只要您處理子串豺旬,范圍索引的精確解釋就不一定重要。

The vast majority of Unicode code points used for writing living languages are represented by single UTF-16 units. However, some less common Unicode code points are represented in UTF-16 by surrogate pairs. A surrogate pair is a sequence of two UTF-16 units, taken from specific reserved ranges, that together represent a single Unicode code point. CFString has functions for converting between surrogate pairs and the UTF-32 representation of the corresponding Unicode code point. When dealing with NSString objects, one constraint is that substring boundaries usually should not separate the two halves of a surrogate pair. This is generally automatic for ranges returned from most Cocoa methods, but if you are constructing substring ranges yourself you should keep this in mind. However, this is not the only constraint you should consider.

  • 用于編寫活語言的絕大多數(shù)Unicode代碼點由單個UTF-16單元表示柒凉。 但是族阅,一些不太常見的Unicode代碼點由代理對以UTF-16表示。 代理對是兩個UTF-16單元的序列膝捞,取自特定的保留范圍坦刀,它們一起代表單個Unicode代碼點。 CFString具有用于在代理對和相應(yīng)Unicode代碼點的UTF-32表示之間進(jìn)行轉(zhuǎn)換的功能蔬咬。 處理NSString對象時鲤遥,一個約束是子串邊界通常不應(yīng)該分隔代理對的兩半。 對于大多數(shù)Cocoa方法返回的范圍林艘,這通常是自動的盖奈,但如果您自己構(gòu)建子字符串范圍,則應(yīng)記住這一點狐援。 但是钢坦,這不是您應(yīng)該考慮的唯一約束。

In many writing systems, a single character may be composed of a base letter plus an accent or other decoration. The number of possible letters and accents precludes Unicode from representing each combination as a single code point, so in general such combinations are represented by a base character followed by one or more combining marks. For compatibility reasons, Unicode does have single code points for a number of the most common combinations; these are referred to as precomposed forms, and Unicode normalization transformations can be used to convert between precomposed and decomposed representations. However, even if a string is fully precomposed, there are still many combinations that must be represented using a base character and combining marks. For most text processing, substring ranges should be arranged so that their boundaries do not separate a base character from its associated combining marks.

  • 在許多書寫系統(tǒng)中啥酱,單個字符可以由基本字母加上重音或其他裝飾組成爹凹。 可能的字母和重音的數(shù)量使得Unicode不能將每個組合表示為單個代碼點,因此通常這樣的組合由基本字符后跟一個或多個組合標(biāo)記表示镶殷。 出于兼容性原因禾酱,Unicode確實為許多最常見的組合提供單個代碼點; 這些被稱為預(yù)合成形式,Unicode規(guī)范化轉(zhuǎn)換可用于在預(yù)合成和分解表示之間進(jìn)行轉(zhuǎn)換绘趋。 但是颤陶,即使字符串是完全預(yù)先組合的,仍然有許多組合必須使用基本字符和組合標(biāo)記來表示埋心。 對于大多數(shù)文本處理指郁,應(yīng)排列子字符串范圍忙上,使其邊界不會將基本字符與其關(guān)聯(lián)的組合標(biāo)記分開拷呆。

In addition, there are writing systems in which characters represent a combination of parts that are more complicated than accent marks. In Korean, for example, a single Hangul syllable can be composed of two or three subparts known as jamo. In the Indic and Indic-influenced writing systems common throughout South and Southeast Asia, single written characters often represent combinations of consonants, vowels, and marks such as viramas, and the Unicode representations of these writing systems often use code points for these individual parts, so that a single character may be composed of multiple code points. For most text processing, substring ranges should also be arranged so that their boundaries do not separate the jamo in a single Hangul syllable, or the components of an Indic consonant cluster.

  • 另外,存在書寫系統(tǒng),其中字符表示比重音符號更復(fù)雜的部分的組合茬斧。 例如腰懂,在韓語中,單個韓語音節(jié)可以由稱為jamo的兩個或三個子部分組成项秉。 在南亞和東南亞常見的印度語和印度語寫作系統(tǒng)中绣溜,單個書寫字符通常表示輔音,元音和標(biāo)記(如變形記)的組合娄蔼,這些書寫系統(tǒng)的Unicode表示通常使用這些單獨部分的代碼點怖喻, 這樣單個字符可以由多個代碼點組成。 對于大多數(shù)文本處理岁诉,還應(yīng)該排列子字符串范圍锚沸,使得它們的邊界不會將單個韓文音節(jié)中的干擾或印度語輔音聚類的組件分開。

In general, these combinations—surrogate pairs, base characters plus combining marks, Hangul jamo, and Indic consonant clusters—are referred to as grapheme clusters. In order to take them into account, you can use NSString’s rangeOfComposedCharacterSequencesForRange: or rangeOfComposedCharacterSequenceAtIndex: methods, or CFStringGetRangeOfComposedCharactersAtIndex. These can be used to adjust string indexes or substring ranges so that they fall on grapheme cluster boundaries, taking into account all of the constraints mentioned above. These methods should be the default choice for programmatically determining the boundaries of user-perceived characters.:

  • 通常涕癣,這些組合 - 代理對哗蜈,基本字符加組合標(biāo)記,Hangul jamo和印度語輔音簇 - 被稱為字形簇坠韩。 為了將它們考慮在內(nèi)距潘,您可以使用NSString的rangeOfComposedCharacterSequencesForRange:或rangeOfComposedCharacterSequenceAtIndex:方法或CFStringGetRangeOfComposedCharactersAtIndex。 這些可以用于調(diào)整字符串索引或子字符串范圍只搁,以便它們落在字形簇邊界上音比,同時考慮到上面提到的所有約束。 這些方法應(yīng)該是以編程方式確定用戶感知字符邊界的默認(rèn)選擇:

In some cases, Unicode algorithms deal with multiple characters in ways that go beyond even grapheme cluster boundaries. Unicode casing algorithms may convert a single character into multiple characters when going from lowercase to uppercase; for example, the standard uppercase equivalent of the German character “?” is the two-letter sequence “SS”. Localized collation algorithms in many languages consider multiple-character sequences as single units; for example, the sequence “ch” is treated as a single letter for sorting purposes in some European languages. In order to deal properly with cases like these, it is important to use standard NSString methods for such operations as casing, sorting, and searching, and to use them on the entire string to which they are to apply. Use NSString methods such as lowercaseString, uppercaseString, capitalizedString, compare: and its variants, rangeOfString: and its variants, and rangeOfCharacterFromSet: and its variants, or their CFString equivalents. These all take into account the complexities of Unicode string processing, and the searching and sorting methods in particular have many options to control the types of equivalences they are to recognize.

  • 在某些情況下氢惋,Unicode算法以超出字形集群邊界的方式處理多個字符硅确。 Unicode套管算法可以在從小寫變?yōu)榇髮憰r將單個字符轉(zhuǎn)換為多個字符;例如,德語字符“?”的標(biāo)準(zhǔn)大寫等價物是雙字母序列“SS”明肮。許多語言中的本地化校對算法將多字符序列視為單個單元;例如菱农,序列“ch”被視為單個字母,用于在某些歐洲語言中進(jìn)行排序柿估。為了正確處理這些情況循未,重要的是使用標(biāo)準(zhǔn)的NSString方法進(jìn)行封裝,排序和搜索等操作秫舌,并在它們要應(yīng)用的整個字符串上使用它們的妖。使用NSString方法,例如lowercaseString足陨,uppercaseString嫂粟,capitalizedString,compare:及其變體墨缘,rangeOfString:及其變體星虹,rangeOfCharacterFromSet:及其變體零抬,或其CFString等價物。這些都考慮了Unicode字符串處理的復(fù)雜性宽涌,特別是搜索和排序方法有許多選項來控制它們要識別的等價類型平夜。

In some less common cases, it may be necessary to tailor the definition of grapheme clusters to a particular need. The issues involved in determining and tailoring grapheme cluster boundaries are covered in detail in Unicode Standard Annex #29, which gives a number of examples and some algorithms. The Unicode standard in general is the best source for information about Unicode algorithms and the considerations involved in processing Unicode strings.

  • 在一些不太常見的情況下,可能有必要根據(jù)特定需要定制字素集群的定義卸亮。 Unicode標(biāo)準(zhǔn)附件#29詳細(xì)介紹了確定和定制字形集群邊界所涉及的問題忽妒,其中給出了許多示例和一些算法。 通常兼贸,Unicode標(biāo)準(zhǔn)是有關(guān)Unicode算法的信息以及處理Unicode字符串所涉及的注意事項的最佳來源段直。

If you are interested in grapheme cluster boundaries from the point of view of cursor movement and insertion point positioning, and you are using the Cocoa text system, you should know that on OS X v10.5 and later, NSLayoutManager has API support for determining insertion point positions within a line of text as it is laid out. Note that insertion point boundaries are not identical to glyph boundaries; a ligature glyph in some cases, such as an “fi” ligature in Latin script, may require an internal insertion point on a user-perceived character boundary. See Cocoa Text Architecture Guide for more information.

  • 如果您從光標(biāo)移動和插入點定位的角度對字形簇邊界感興趣,并且您正在使用Cocoa文本系統(tǒng)溶诞,您應(yīng)該知道在OS X v10.5及更高版本中坷牛,NSLayoutManager具有用于確定插入的API支持 在布置的一行文本中指出位置。 請注意很澄,插入點邊界與字形邊界不同; 在某些情況下京闰,例如拉丁文字中的“fi”連字,連字字形可能需要在用戶感知的字符邊界上使用內(nèi)部插入點甩苛。 有關(guān)更多信息蹂楣,請參閱Cocoa Text Architecture Guide。
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
  • 序言:七十年代末讯蒲,一起剝皮案震驚了整個濱河市痊土,隨后出現(xiàn)的幾起案子,更是在濱河造成了極大的恐慌墨林,老刑警劉巖赁酝,帶你破解...
    沈念sama閱讀 212,884評論 6 492
  • 序言:濱河連續(xù)發(fā)生了三起死亡事件,死亡現(xiàn)場離奇詭異旭等,居然都是意外死亡酌呆,警方通過查閱死者的電腦和手機(jī),發(fā)現(xiàn)死者居然都...
    沈念sama閱讀 90,755評論 3 385
  • 文/潘曉璐 我一進(jìn)店門搔耕,熙熙樓的掌柜王于貴愁眉苦臉地迎上來隙袁,“玉大人,你說我怎么就攤上這事弃榨∑惺眨” “怎么了?”我有些...
    開封第一講書人閱讀 158,369評論 0 348
  • 文/不壞的土叔 我叫張陵鲸睛,是天一觀的道長娜饵。 經(jīng)常有香客問我,道長官辈,這世上最難降的妖魔是什么箱舞? 我笑而不...
    開封第一講書人閱讀 56,799評論 1 285
  • 正文 為了忘掉前任遍坟,我火速辦了婚禮,結(jié)果婚禮上褐缠,老公的妹妹穿的比我還像新娘政鼠。我一直安慰自己风瘦,他們只是感情好队魏,可當(dāng)我...
    茶點故事閱讀 65,910評論 6 386
  • 文/花漫 我一把揭開白布。 她就那樣靜靜地躺著万搔,像睡著了一般胡桨。 火紅的嫁衣襯著肌膚如雪。 梳的紋絲不亂的頭發(fā)上瞬雹,一...
    開封第一講書人閱讀 50,096評論 1 291
  • 那天昧谊,我揣著相機(jī)與錄音,去河邊找鬼酗捌。 笑死呢诬,一個胖子當(dāng)著我的面吹牛,可吹牛的內(nèi)容都是我干的胖缤。 我是一名探鬼主播尚镰,決...
    沈念sama閱讀 39,159評論 3 411
  • 文/蒼蘭香墨 我猛地睜開眼,長吁一口氣:“原來是場噩夢啊……” “哼哪廓!你這毒婦竟也來了狗唉?” 一聲冷哼從身側(cè)響起,我...
    開封第一講書人閱讀 37,917評論 0 268
  • 序言:老撾萬榮一對情侶失蹤涡真,失蹤者是張志新(化名)和其女友劉穎分俯,沒想到半個月后,有當(dāng)?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體哆料,經(jīng)...
    沈念sama閱讀 44,360評論 1 303
  • 正文 獨居荒郊野嶺守林人離奇死亡缸剪,尸身上長有42處帶血的膿包…… 初始之章·張勛 以下內(nèi)容為張勛視角 年9月15日...
    茶點故事閱讀 36,673評論 2 327
  • 正文 我和宋清朗相戀三年,在試婚紗的時候發(fā)現(xiàn)自己被綠了东亦。 大學(xué)時的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片橄登。...
    茶點故事閱讀 38,814評論 1 341
  • 序言:一個原本活蹦亂跳的男人離奇死亡,死狀恐怖讥此,靈堂內(nèi)的尸體忽然破棺而出拢锹,到底是詐尸還是另有隱情,我是刑警寧澤萄喳,帶...
    沈念sama閱讀 34,509評論 4 334
  • 正文 年R本政府宣布卒稳,位于F島的核電站,受9級特大地震影響他巨,放射性物質(zhì)發(fā)生泄漏充坑。R本人自食惡果不足惜减江,卻給世界環(huán)境...
    茶點故事閱讀 40,156評論 3 317
  • 文/蒙蒙 一、第九天 我趴在偏房一處隱蔽的房頂上張望捻爷。 院中可真熱鬧辈灼,春花似錦、人聲如沸也榄。這莊子的主人今日做“春日...
    開封第一講書人閱讀 30,882評論 0 21
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽甜紫。三九已至降宅,卻和暖如春,著一層夾襖步出監(jiān)牢的瞬間囚霸,已是汗流浹背腰根。 一陣腳步聲響...
    開封第一講書人閱讀 32,123評論 1 267
  • 我被黑心中介騙來泰國打工, 沒想到剛下飛機(jī)就差點兒被人妖公主榨干…… 1. 我叫王不留拓型,地道東北人额嘿。 一個月前我還...
    沈念sama閱讀 46,641評論 2 362
  • 正文 我出身青樓,卻偏偏與公主長得像劣挫,于是被迫代替她去往敵國和親册养。 傳聞我的和親對象是個殘疾皇子,可洞房花燭夜當(dāng)晚...
    茶點故事閱讀 43,728評論 2 351

推薦閱讀更多精彩內(nèi)容

  • rljs by sennchi Timeline of History Part One The Cognitiv...
    sennchi閱讀 7,312評論 0 10
  • 2018-10-21 加入一年制時間管理司馬騰自控力學(xué)院揣云,是我給自己一個生日禮物捕儒。不知不覺已經(jīng)接近半年了,這半年以...
    Super嘉祺暖陽閱讀 142評論 0 0
  • 午夢深垂邓夕,誰人出閨刘莹? 初不在意,炮聲如雷焚刚。 尚不經(jīng)心点弯,人語如沸。 須臾轉(zhuǎn)寂矿咕,憂心始微抢肛。 撥簾瞻望,轎馬欲歸碳柱。 忽覺...
    wikii的果異奇閱讀 195評論 0 0
  • 1.Telechips 最早從事平板這個產(chǎn)業(yè)的人捡絮,基本上都是從MP3,MP4轉(zhuǎn)過來的莲镣,在大家看來福稳,所謂的MID,只...
    三石而立_閱讀 1,632評論 0 0
  • 身為一名吃貨鼓拧,無時無刻的在吃是一種特征。 凌晨的1點越妈,我們在街頭季俩,廣州番禺南村,尋找一種番禺特有的味道梅掠。 靠近屠宰...
    龍傲天Terry閱讀 386評論 0 0