Adding full support for a language touches many different parts of the spaCy library. This guide explains how to fit everything together, and points you to the specific workflows for each component.
添加一個(gè)完整的語(yǔ)言支持涉及很多不同部分的spaCy庫(kù)硕勿,本文針對(duì)如何融合所有內(nèi)容赌莺,并說(shuō)明每個(gè)組件的工作流程贞奋。
WORKING ON SPACY'S SOURCE(使用spaCy資源)
To add a new language to spaCy, you'll need to modify the library's code. The easiest way to do this is to clone the repository and build spaCy from source. For more information on this, see the installation guide. Unlike spaCy's core, which is mostly written in Cython, all language data is stored in regular Python files. This means that you won't have to rebuild anything in between –you can simply make edits and reload spaCy to test them.
要為spaCy添加新語(yǔ)言养叛,需要修改library的代碼域仇,最簡(jiǎn)單的方法是克隆repository(https://github.com/explosion/spaCy),之后從源碼build喇闸。參見(jiàn)安裝指南中關(guān)于此方法的詳細(xì)內(nèi)容修赞。spaCy的核心代碼基本上都是用Cython寫(xiě)的,不過(guò)所有的語(yǔ)言數(shù)據(jù)都是以常規(guī)的Python文件稚配。這樣就不需要重建任何代碼畅涂,只需簡(jiǎn)單的修改和重新調(diào)用spaCy就可以進(jìn)行語(yǔ)言測(cè)試。
Obviously, there are lots of ways you can organise your code when you implement your own language data. This guide will focus on how it's done within spaCy. For full language support, you'll need to create a Language subclass, define custom language data, like a stop list and tokenizer exceptions and test the new tokenizer. Once the language is set up, you can build the vocabulary, including word frequencies, Brown clusters and word vectors. Finally, you can train the tagger and parser, and save the model to a directory.
部署自定義語(yǔ)言數(shù)據(jù)時(shí)有很多方法可以組織代碼道川。本文將聚焦于如何用spaCy完成午衰。完整的語(yǔ)言支持,需要?jiǎng)?chuàng)建Language子集冒萄,聲明自定義語(yǔ)言數(shù)據(jù)臊岸,比如停用詞列表和例外分詞,并且測(cè)試新的分詞器宦言。語(yǔ)言設(shè)置完成扇单,就可以創(chuàng)建詞匯表,包括詞頻奠旺、布朗集(Brown Cluster)和詞向量蜘澜。然后就可以訓(xùn)練并保存Tagger和Parser模型了。
For some languages, you may also want to develop a solution for lemmatization and morphological analysis.
對(duì)于有的語(yǔ)言响疚,可能還可以開(kāi)發(fā)詞形還原和詞型分析的方案鄙信。
Language data 語(yǔ)言數(shù)據(jù)
Every language is different – and usually full of exceptions and special cases, especially amongst the most common words. Some of these exceptions are shared across languages, while others are entirely specific – usually so specific that they need to be hard-coded. The lang? module contains all language-specific data, organised in simple Python files. This makes the data easy to update and extend.
每一種語(yǔ)言都不相同 – 而且通常都有很多例外和特殊情況,尤其是最常見(jiàn)的詞忿晕。其中一些例外情況是各語(yǔ)言間通用的装诡,但其他的則是完全特殊的– 經(jīng)常是特殊到需要硬編碼。spaCy中的lang模塊包含了大多數(shù)特殊語(yǔ)言數(shù)據(jù),以簡(jiǎn)單的Python文件進(jìn)行組織鸦采,以便于升級(jí)和擴(kuò)展數(shù)據(jù)宾巍。
The shared language data in the directory root includes rules that can be generalised? across languages – for example, rules for basic punctuation, emoji, emoticons, single-letter abbreviations and norms for equivalent tokens with different spellings, like " and ”. This helps the models make more accurate predictions. The individual language data in a submodule contains rules that are only relevant to a particular language. It also takes care of? putting together all components and creating the Language subclass – for example,English or German.
在根目錄中的通用語(yǔ)言數(shù)據(jù)包含了廣義的跨語(yǔ)言規(guī)則,例如:基本的標(biāo)點(diǎn)渔伯、表情符號(hào)顶霞、情感符號(hào)、單字母縮寫(xiě)的規(guī)則以及不同拼寫(xiě)的等義標(biāo)記锣吼,比如“and”选浑。這樣有助于模型作出更準(zhǔn)確的預(yù)測(cè)。子模塊中的特定語(yǔ)言數(shù)據(jù)包含的規(guī)則僅與特定語(yǔ)言相關(guān)玄叠,還負(fù)責(zé)整合所有組件和創(chuàng)建語(yǔ)言子集– 例如:英語(yǔ) 或 德語(yǔ)古徒。
from spacy.lang.en import English
from spacy.lang.de import German
nlp_en = English() # includes English data
nlp_de = German() # includes German data
Stop?words stop_words.py
List of most common words of a language that are often useful to filter out, for example "and" or "I". Matching tokens will return?True?for?is_stop.
停用詞語(yǔ)言中進(jìn)行數(shù)據(jù)處理之前或之后通常會(huì)自動(dòng)過(guò)濾掉某些字或詞的列表
Tokenizer?exceptions?tokenizer_exceptions.py
Special-case rules for the tokenizer, for example,? contractions like “can’t” and abbreviations with punctuation, like “U.K.”.
例外分詞特殊分詞,例如:縮寫(xiě)can’t和帶標(biāo)點(diǎn)的縮寫(xiě)詞U.K.(常規(guī)中文好像沒(méi)這情況)
Norm? exceptions?norm_exceptions.py
Special-case rules for normalising tokens to improve the? model's predictions, for example on American vs. British spelling.??
Punctuation?rules??punctuation.py
Regular expressions for splitting tokens, e.g. on? punctuation or special characters like emoji. Includes rules for prefixes,? suffixes and infixes.
標(biāo)點(diǎn)規(guī)則標(biāo)點(diǎn)或特殊字符(如表情)等等正則表達(dá)式读恃,包括前綴隧膘、后綴和連接符的規(guī)則。
Character?classes?char_classes.py
Character classes to be used in regular expressions, for? example, latin characters, quotes, hyphens or icons.
字符集正則表達(dá)式中所用的字符集狐粱,例如:拉丁舀寓、引用胆数、連字符或圖標(biāo)等
Lexical Attributes?lex_attrs.py
Custom functions for setting lexical tributes on tokens, e.g.?like_num, which? includes language-specific words like “ten” or “hundred”.
詞性例如:like_num:包括十肌蜻、百、千等特殊詞必尼。
Syntax?iterators?syntax_iterators.py
Functions that compute views of a?Doc?object based on its syntax. At the moment, only used for?noun-chunkes蒋搜。
Lemmatizer?lemmatizer.py
Lemmatization rules or a lookup-based lemmatization table? to assign base forms, for example "be" for "was".
詞型還原英語(yǔ)討厭的時(shí)態(tài)、單復(fù)數(shù)判莉,偉大的中文不這么土
Tag?map tag_map.py
Dictionary mapping strings in your tag set to?Universal Dependencies tags.
Morph?rules morph_rules.py
Exception rules for morphological analysis of irregular? words like personal pronouns.
詞變形規(guī)則?
The individual components?expose variables?that can be imported within a language module, and added to the language's?Defaults. Some components, like the punctuation rules, usually don't need much customisation and can simply be imported from the global rules. Others, like the tokenizer and norm exceptions, are very specific and will make a big difference to spaCy's performance on the particular language and training a language model.
個(gè)別組件可以到語(yǔ)言模塊中豆挽,被加入到語(yǔ)言的Defaults。有些組件比如標(biāo)點(diǎn)符號(hào)規(guī)則券盅,通常不需要很多自定義帮哈,而是簡(jiǎn)單的引入通用規(guī)則。其他的如tokenizer和norm exceptions很特別锰镀,會(huì)較大程度上影響spaCy對(duì)特定語(yǔ)言和訓(xùn)練語(yǔ)言模型的性能效果娘侍。
SHOULDI EVER UPDATE THE GLOBAL DATA?
Reuseable language data is collected as atomic pieces in the root of the?spacy.lang??package. Often, when a new language is added, you'll find a pattern or symbol that's missing. Even if it isn't common in other languages, it might be best to add it to the shared language data, unless it has some conflicting interpretation. For instance, we don't expect to see guillemot quotation symbols (??and??) in English text. But if we do see them, we'd probably prefer the tokenizer to split them off.
是否應(yīng)更新全局?jǐn)?shù)據(jù)
可復(fù)用的語(yǔ)言數(shù)據(jù)作為原子碎片被置于spacy.lang包的根節(jié)點(diǎn)。通常泳炉,添加新語(yǔ)言后憾筏,會(huì)發(fā)現(xiàn)有圖案或符號(hào)缺失。即使在其他語(yǔ)言中并不常見(jiàn)花鹅,或許最好還是將其加入通用語(yǔ)言數(shù)據(jù)中氧腰,除非有沖突。
FORLANGUAGES WITH NON-LATIN CHARACTERS
In order for the tokenizer to split suffixes, prefixes and infixes, spaCy needs to know the language's character set. If the language you're adding uses non-latin characters, you might need to add the required character classes to the global?char_classes.py?. spaCy uses the?regex?library?to keep this simple and readable. If the language requires very specific punctuation rules, you should consider overwriting the default regular expressions with your own in the language's?Defaults.
中文的全角標(biāo)點(diǎn)符號(hào)需要定義。
The Language subclass 語(yǔ)言子集
Language-specific code and resources should be organised into a subpackage of spaCy, named according to the language's ISO code. For instance, code and resources specific to Spanish are placed into a directory spacy/lang/es, which can be imported as spacy.lang.es.
特定語(yǔ)言代碼和資源應(yīng)組織為spaCy的子包古拴,以語(yǔ)言標(biāo)準(zhǔn)編碼(ISO)命名箩帚,例如:中文應(yīng)位于spacy/lang/zh目錄,就能夠以 spacy.lang.zh 引入了黄痪。
To get started, you can use our templates for the most important files. Here's what the class template looks like:
最重要的文件的模版:
__INIT__.PY (EXCERPT)
# import language-specific data
from .stop_words import STOP_WORDS
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
from .lex_attrs import LEX_ATTRS
from ..tokenizer_exceptions import BASE_EXCEPTIONS
from ...language import Language
from ...attrs import LANG
from ...util import update_exc
# create Defaults class in the module scope (necessary for pickling!)
class XxxxxDefaults(Language.Defaults):
? ? lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
? ? lex_attr_getters[LANG] = lambda text: 'xx' # language ISO code
? ? # optional: replace flags with custom functions, e.g. like_num()
? ? lex_attr_getters.update(LEX_ATTRS)
? ? # merge base exceptions and custom tokenizer exceptions
? ? tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
? ? stop_words = STOP_WORDS
# create actual Language class
class Xxxxx(Language):
? ? lang = 'xx' # language ISO code
? ? Defaults = XxxxxDefaults # override defaults
# set default export – this allows the language class to be lazy-loaded
__all__ = ['Xxxxx']
WHY LAZY-LOADING?
Some languages contain large volumes of custom data, like lemmatizer lookup tables, or complex regular expression that are expensive to compute. As of spaCy v2.0, Language classes are not imported on initialisation and are only loaded when you import them directly, or load a model that requires a language to be loaded. To lazy-load languages in your application, you can use the util.get_lang_class()? helper function with the two-letter language code as its argument.
為什么要延遲加載
有些語(yǔ)言包含大量的定制數(shù)據(jù)膏潮,復(fù)雜規(guī)則等,計(jì)算成本很高满力。spaCy2.0中焕参,語(yǔ)言集不在初始化時(shí)引入,僅于import的時(shí)候才加載油额,或者加載包含語(yǔ)言的模型叠纷。在應(yīng)用中延遲加載語(yǔ)言,使用util.get_lang_class()潦嘶,參數(shù)為兩位語(yǔ)言編碼涩嚣。
Stop words 停用詞
A "stop list" is a classic trick from the early days of information retrieval when search was largely about keyword presence and absence. It is still sometimes useful today to filter out common words from a bag-of-words model. To improve readability, STOP_WORDS are separated by spaces and newlines, and added as a multiline string.
停用詞意義不用多說(shuō)了,一切為了效率和質(zhì)量掂僵。
WHAT DOES SPACY CONSIDER ASTOP WORD?
There's no particularly principled logic behind what words should be added to the stop list. Make a list that you think might be useful to people and is likely to be unsurprising. As a rule of thumb, words that are very rare are unlikely to be useful stop words.
spaCy是如何考慮停用詞的航厚?
關(guān)于什么詞應(yīng)該加入停用詞表,沒(méi)有什么特別的原則性邏輯锰蓬。建議直接參考使用復(fù)旦或哈工大的停用詞幔睬,比較成熟彪笼。
關(guān)鍵還是怎么定義怎么用我纪,看定義樣例:
EXAMPLE
STOP_WORDS = set(""" a about above across after afterwards again against all almost alone along already also although always am among amongst amount an and another any anyhow anyone anything anyway anywhere are around as at back be became because become becomes becoming been before before hand behind being below beside besides between beyond both bottom but by """).split())
樣例中引號(hào)里的一堆詞就是停用詞們,把中文的停用詞表加進(jìn)去就OK了淘钟。
IMPORTANT NOTE
When adding stop words from an online source, always include the link in a comment. Make sure to proofread and double-check the words carefully. A lot of the lists available online have been passed around for years and often contain mistakes, like unicode errors or random words that have once been added for a specific use case, but don't actually qualify.
重要2湛ā8ㄉ觥!
一定要反復(fù)校對(duì)那些詞轮锥,網(wǎng)上的很多詞表已經(jīng)過(guò)時(shí)了矫钓,而且經(jīng)常有錯(cuò)誤(最常見(jiàn)unicode錯(cuò)誤)。
Tokenizer exceptions 例外分詞
spaCy's tokenization algorithm lets you deal with whitespace-delimited chunks separately. This makes it easy to define special-case rules, without worrying about how they interact with the rest of the tokenizer. Whenever the key string is matched, the special-case rule is applied, giving the defined sequence of tokens. You can also attach attributes to the subtokens, covered by your special case, such as the subtokens LEMMA orTAG.
spaCy的分詞算法可以處理空格和tab分隔舍杜。很容易定義特殊情況規(guī)則新娜,不需擔(dān)心與其他分詞器的相互影響。一旦key string匹配蝴簇,規(guī)則就會(huì)生效杯活,給出定義好的分詞序列。也可以附加屬性覆蓋特殊情況定義熬词,例如 LEMMA 或 TAG旁钧。
IMPORTANTNOTE
If an exception consists of more than one token, the ORTH values combined always need to match the original string. The way the original string is split up can be pretty arbitrary sometimes –for example "gonna" is split into"gon" (lemma "go") and "na" (lemma"to"). Because of how the tokenizer works, it's currently not possible to split single-letter strings into multiple tokens.
重要N亍!歪今!
例如:Gonna 定義為 gon(go)和 na(to)嚎幸,單個(gè)字母不可能再split。中文沒(méi)這么垃圾的東西吧寄猩。
Unambiguous abbreviations, like month names or locations in English, should be added to exceptions with a lemma assigned, for example {ORTH: "Jan.", LEMMA: "January"}. Since the exceptions are added in Python, you can use custom logic to generate them more efficiently and make your data less verbose. How you do this ultimately depends on the language. Here's an example of how exceptions for time formats like"1a.m." and "1am" are generated in the English tokenizer_exceptions.py:
縮寫(xiě)問(wèn)題嫉晶,月份縮寫(xiě),地點(diǎn)縮寫(xiě)等田篇,例如:Jan. 還原為 January替废,那么中文就還原為 一月吧,具體情況取決于語(yǔ)言泊柬,比如定制中文時(shí)椎镣,忽略 Jan這種情況。以下是英文時(shí)間的定義樣例tokenizer_exceptions.py:
# use short, internal variable for readability
_exc = {}
for h in range(1, 12 + 1):
??? for period in["a.m.", "am"]:
??????? # always keep an eye onstring interpolation!
??????? _exc["%d%s" %(h, period)] = [
??????????? {ORTH: "%d"% h},
??????????? {ORTH: period, LEMMA:"a.m."}]
??? for period in["p.m.", "pm"]:
??????? _exc["%d%s" %(h, period)] = [
??????????? {ORTH: "%d"% h},
??????????? {ORTH: period, LEMMA:"p.m."}]
# only declare this at the bottom
TOKENIZER_EXCEPTIONS = _exc
GENERATINGTOKENIZER EXCEPTIONS
Keep in mind that generating exceptions only makes sense if there's a clearly defined and finite number of them, like common contractions in English. This is not always the case –in Spanish for instance, infinitive or imperative reflexive verbs and pronouns are one token (e.g. "vestirme"). Incases like this, spaCy shouldn't be generating exceptions for all verbs.Instead, this will be handled at a later stage during lemmatization.
生成TOKENIZER EXCEPTIONS
要注意兽赁,只有明確定義的和有限數(shù)量的例外定義才合理状答,比如英文中的常見(jiàn)縮寫(xiě)。其他語(yǔ)言視具體情況不同刀崖,spaCy不能夠?yàn)樗性~匯生成例外規(guī)則惊科。可以試試后文提到的lemmatization(詞干提攘燎铡)馆截。
When adding the tokenizer exceptions to theDefaults, you can use the update_exc()?helper function to merge them with the global base exceptions (including one-letter abbreviations and emoticons). The function performs a basic check to make sure exceptions are provided in the correct format. It can take any number of exceptions dicts as its arguments, and will update and overwrite the exception in this order. For example, if your language's tokenizer exceptions include a custom tokenization pattern for "a.", it will overwrite the base exceptions with the language's custom one.
在缺省定義中添加tokenizer exceptions時(shí),可以使用update_exc() 輔助函數(shù)以合并至全局設(shè)置(包括單字符縮寫(xiě)和表情)或悲。該函數(shù)執(zhí)行基本的格式合法性檢驗(yàn)孙咪,且可以使用多個(gè)例外字典作為參數(shù)堪唐,并且將更新覆蓋原定義巡语。
EXAMPLE
from ...util import update_exc
BASE_EXCEPTIONS =?{"a.": [{ORTH: "a."}], ":)": [{ORTH:":)"}]}
TOKENIZER_EXCEPTIONS = {"a.": [{ORTH: "a.", LEMMA:"all"}]}
tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
# {"a.": [{ORTH: "a.", LEMMA: "all"}],":}": [{ORTH: ":}"]]}
ABOUTSPACY'S CUSTOM PRONOUN LEMMA
Unlike verbs and common nouns, there's no clear base form of a personal pronoun. Should the lemma of "me" be "I", or should we normalize person as well, giving "it" —or maybe "he"? spaCy's solution is to introduce a novel symbol, -PRON-, which is used as the lemma for all personal pronouns.
關(guān)于spaCy的代詞詞元定義
不同于動(dòng)詞和常規(guī)名詞,人稱代詞沒(méi)有基本格式淮菠。中文比英文好些男公,拿英文說(shuō)事吧,me應(yīng)該是I合陵,或者應(yīng)該規(guī)范為人也行枢赔,還有it或者也可以是he?spaCy的解決方案是引入一個(gè)專有標(biāo)志 –PRON- 拥知,用來(lái)標(biāo)記所有人稱代詞踏拜。
Norm exceptions 例外規(guī)范
In addition to ORTH or LEMMA, tokenizer exceptions can also set a NORM attribute. This is useful to specify a normalised version of the token –for example, the norm of "n't" is "not". By default, a token's norm equals its lowercase text. If the lowercase spelling of a word exists, norms should always be in lowercase.
除了ORTH和詞元之外,tokenizer exceptions也可以設(shè)置一個(gè)規(guī)范屬性低剔。指定一個(gè)標(biāo)準(zhǔn)版本的token很有用速梗,例如肮塞,還是英文舉例(中文好像沒(méi)這么亂吧):n’t是not。默認(rèn)情況下姻锁,一個(gè)token的規(guī)范是小寫(xiě)文本枕赵。如果一個(gè)詞的小寫(xiě)存在,規(guī)范應(yīng)該一直是小寫(xiě)(中文的小寫(xiě)大寫(xiě)問(wèn)題好像只有數(shù)字吧位隶,該不該算進(jìn)去呢拷窜?)。
NORMS VS. LEMMAS
doc = nlp(u"I'm gonna realise")
norms = [token.norm_ for token in doc]
lemmas = [token.lemma_ for token in doc]
assert norms == ['i', 'am', 'going', 'to', 'realize']
assert lemmas == ['i', 'be', 'go', 'to', 'realise']
spaCy usually tries to normalise words with different spellings to a single, common spelling. This has no effect on any other token attributes, or tokenization in general, but it ensures that equivalent tokens receive similar representations. This can improve the model's predictions on words that weren't common in the training data, but are equivalent to other words –for example, "realize" and "realise", or "thx" and"thanks".
spaCy通常會(huì)嘗試將同一個(gè)詞的不同拼寫(xiě)規(guī)范化涧黄,常規(guī)化(這就是拼寫(xiě)文字和象形文字的不同了)篮昧。這在其他token屬性或一般tokenization中沒(méi)有效果,但是這確保等效tokens得到類似的表述笋妥。這樣就能夠提升模型對(duì)那些在訓(xùn)練數(shù)據(jù)中不常見(jiàn)恋谭,但是同其他詞差不多的詞的預(yù)測(cè)能力,例如:realize和realizse挽鞠,或者thx和thanks疚颊。(中文有啥?謝了 – 謝謝了 – 謝謝您了 – 太謝謝您了 ……中文有這必要嗎)
Similarly, spaCy also includes global base norms for normalising different styles of quotation marks and currency symbols. Even though $ and €are very different, spaCy normalises them both to $. This way, they'll always be seen as similar, no matter how common they were in the training data.
同樣的信认,spaCy也包括將不通類型的引號(hào)和貨幣符號(hào)規(guī)范化的全局基本規(guī)范(https://github.com/explosion/spaCy/blob/master/spacy/lang/norm_exceptions.py)材义。即使 $ 和¥ 有很大差別,spaCy會(huì)將它們統(tǒng)一規(guī)范為 $嫁赏。這樣其掂,不論它們?cè)谟?xùn)練數(shù)據(jù)中有多常見(jiàn),都將被同等處理潦蝇。
Norm exceptions can be provided as a simple dictionary. For more examples, see the English norm_exceptions.py .
Norm exceptions可以被作為一個(gè)簡(jiǎn)單的字典款熬。更多樣例參見(jiàn)英文語(yǔ)言中的norm_exceptions.py
EXAMPLE
NORM_EXCEPTIONS = {
??? "cos":"because",
??? "fav":"favorite",
??? "accessorise":"accessorize",
??? "accessorised":"accessorized"
}
To add the custom norm exceptions lookup table, you can use the?add_lookups()?helper functions. It takes the default attribute getter function as its first argument, plus a variable list of dictionaries. If a string's norm is found in one of the dictionaries, that value is used – otherwise, the default function is called and the token is assigned its default norm.
通過(guò)add_lookups()輔助函數(shù)來(lái)添加自定義norm exceptions查詢表。它使用默認(rèn)屬性的getter函數(shù)作為其第一個(gè)參數(shù)攘乒,外加一個(gè)字典變量表贤牛。如果在某個(gè)字典中發(fā)現(xiàn)了一個(gè)字符串的規(guī)范,則取值– 否則则酝,調(diào)用默認(rèn)函數(shù)并且將默認(rèn)規(guī)范賦值給token殉簸。
lex_attr_getters[NORM] =add_lookups(Language.Defaults.lex_attr_getters[NORM],? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? NORM_EXCEPTIONS, BASE_NORMS)
The order of the dictionaries is also the lookup order –so if your language's norm exceptions overwrite any of the global exceptions, they should be added first.Also note that the tokenizer exceptions will always have priority over the attribute getters.
字典的排序也是查詢排序– 所以,如果語(yǔ)言的norm exceptions覆蓋了任何全局exceptions沽讹,將被首先添加般卑。同時(shí)注意tokenizer exceptions總是優(yōu)先于屬性getter。
Lexical attributes 詞性
spaCy provides a range of Token attributes that return useful information on that token –for example, whether it's uppercase or lowercase, a left or right punctuation mark, or whether it resembles a number or email address. Most of these functions, like is_lower or like_url should be language-independent. Others, like like_num(which includes both digits and number words), requires some customisation.
spaCy提供了一堆Token屬性來(lái)返回token的有用信息爽雄,例如:無(wú)論大寫(xiě)還是小寫(xiě)形式蝠检,左右引號(hào),或不論是類似于數(shù)字或email地址挚瘟。大部分函數(shù)叹谁,比如:is_lower或者like_urls都應(yīng)該是獨(dú)立語(yǔ)言的迟杂。其他的像like_num(包括數(shù)字和大寫(xiě)數(shù)字),則需要進(jìn)行定制本慕。
BEST PRACTICES
English number words are pretty simple, because even large numbers consist of individual tokens, and we can get away with splitting and matching strings against a list. In other languages, like German, "two hundred and thirty-four" is one word, and thus one token. Here, it's best to match a string against a list of number word fragments (instead of a technically almost infinite list of possible number words).
最佳方案
英文數(shù)字單詞非常簡(jiǎn)單排拷,因?yàn)榧词勾髷?shù)字也是由獨(dú)立的tokens組成的,我們可以避免分隔和靠列表匹配字符串锅尘。其他語(yǔ)言中监氢,比如德語(yǔ),two hundred and thirty-four是一個(gè)詞藤违,也是一個(gè)token浪腐。這里最好是基于一個(gè)數(shù)字單詞片段的列表(而不是技術(shù)上幾乎無(wú)限的可能的數(shù)字單詞的列表)進(jìn)行字符串匹配。(這一塊中文也應(yīng)該是一樣原理了)
英文詞性定義樣例:
LEX_ATTRS.PY
_num_words = ['zero', 'one', 'two', 'three', 'four', 'five', 'six','seven',
????????????? 'eight', 'nine','ten', 'eleven', 'twelve', 'thirteen', 'fourteen',
????????????? 'fifteen','sixteen', 'seventeen', 'eighteen', 'nineteen', 'twenty',
????????????? 'thirty', 'forty','fifty', 'sixty', 'seventy', 'eighty', 'ninety',
????????????? 'hundred','thousand', 'million', 'billion', 'trillion', 'quadrillion',
????????????? 'gajillion','bazillion']
def like_num(text):
??? text = text.replace(',','').replace('.', '')
??? if text.isdigit():
??????? return True
??? if text.count('/') == 1:
??????? num, denom =text.split('/')
??????? if num.isdigit() anddenom.isdigit():
??????????? return True
??? if text.lower() in _num_words:
??????? return True
??? return False
LEX_ATTRS = {
??? LIKE_NUM: like_num
}
By updating the default lexical attributeswith a custom LEX_ATTRS dictionary in the language's defaults vialex_attr_getters.update(LEX_ATTRS), only the new custom functions are overwritten.
通過(guò)lex_getters.update(LEX_ATTRS)使用一個(gè)定制LEX_ATTRS字典更新語(yǔ)言默認(rèn)詞性屬性顿乒,只有新定義的函數(shù)會(huì)被覆蓋议街。
Syntax iterators 語(yǔ)法迭代器
Syntax iterators are functions that compute views of a Doc object based on its syntax. At the moment, this data is only used for extracting noun chunks, which are available as the Doc.noun_chunks property.Because base noun phrases work differently across languages, the rules to compute them are part of the individual language's data. If a language does not include a noun chunks iterator, the property won't be available. For examples, see the existing syntax iterators:
語(yǔ)法迭代器是計(jì)算基于語(yǔ)法的DOC對(duì)象的視圖的函數(shù)。目前璧榄,數(shù)據(jù)僅用來(lái)提取詞塊特漩,其屬性為Doc.noun_chunks。因?yàn)榛久~短語(yǔ)的工作各語(yǔ)言不同骨杂,計(jì)算規(guī)則為各語(yǔ)言數(shù)據(jù)的一部分涂身。如果語(yǔ)言不包含一個(gè)詞塊迭代器,則沒(méi)有noun_chunks屬性搓蚪。如下例:
NOUN CHUNKS EXAMPLE
doc = nlp(u'A phrase with another phrase occurs.')
chunks = list(doc.noun_chunks)
assert chunks[0].text == "A phrase"
assert chunks[1].text == "another phrase"
Lemmatizer詞形還原器
As of v2.0, spaCy supports simple lookup-based lemmatization. This is usually the quickest and easiest way to get started. The data is stored in a dictionary mapping a string to its lemma. To determine a token's lemma, spaCy simply looks it up in the table. Here's an example from the Spanish language data:
截至v2.0蛤售,spaCy支持簡(jiǎn)單的基于查詢的詞形還原。這一般是最快最簡(jiǎn)單的入門方法妒潭。字典數(shù)據(jù)映射詞形字符串悴能。要判定一個(gè)token的詞形,spaCy會(huì)于查詢表中快速查找雳灾。西班牙文樣例:
LANG/ES/LEMMATIZER.PY (EXCERPT)
LOOKUP = {
??? "aba":"abar",
??? "ababa":"abar",
??? "ababais":"abar",
??? "ababan":"abar",
??? "ababanes":"ababán",
??? "ababas":"abar",
??? "ababoles":"ababol",
??? "ababábites":
"ababábite"
}
To provide a lookup lemmatizer for your language, import the lookup table and add it to the Language class as lemma_lookup:
引入查詢表到語(yǔ)言子集的lemma_lookup漠酿,為語(yǔ)言提供詞型還原器,方法如下例:
lemma_lookup = dict(LOOKUP)
Tag map
Most treebanks define a custom part-of-speechtag scheme, striking a balance between level of detail and ease of prediction. While it's useful to have custom tagging schemes, it's also useful to have a common scheme, to which the more specific tags can be related. The tagger can learn a tag scheme with any arbitrary symbols. However, you need to define how those symbols map down to the Universal Dependencies tag set. This is done by providing a tag map.
多數(shù)樹(shù)庫(kù)都聲明一個(gè)自定義詞類標(biāo)簽體系佑女,打破細(xì)節(jié)和易預(yù)測(cè)性水平之間的平衡(沒(méi)搞明白)记靡。自定義標(biāo)簽體系很有用,常規(guī)體系也很有用团驱,其中更多的標(biāo)簽可以關(guān)聯(lián)起來(lái)。標(biāo)記器能夠以任意符號(hào)學(xué)習(xí)一個(gè)標(biāo)簽體系空凸。不過(guò)需要定義這些符號(hào)映射到Universal Dependencies tag set(這玩意兒很有用)嚎花。這就要通過(guò)提供一個(gè)tag map做到了。
The keys of the tag map should be strings in your tag set. The values should be a dictionary. The dictionary must have an entry POS whose value is one of the Universal Dependencies tags. Optionally, you can also include morphological features or other token attributes in the tag map as well. This allows you to do simple rule-based morphological analysis.
Tag map的keys應(yīng)該是標(biāo)簽集中的字符串呀洲。Value應(yīng)該是字典紊选。字典必須有POS記錄啼止,其值為Universal Dependencies tags中的一個(gè)。另外兵罢,還可以在tag map中包含詞法特征或者token的其他屬性献烦,這樣就可以進(jìn)行簡(jiǎn)單的基于規(guī)則的形態(tài)分析了。
下面看樣例:
EXAMPLE
from ..symbols import POS, NOUN, VERB, DET
TAG_MAP = {
??? "NNS":? {POS: NOUN, "Number":"plur"},
??? "VBG":? {POS: VERB, "VerbForm":"part", "Tense": "pres", "Aspect":"prog"},
??? "DT":?? {POS: DET}
}
Morph rules 形態(tài)規(guī)則
The morphology rules let you set token attributes such as lemmas, keyed by the extended part-of-speech tag and token text. The morphological features and their possible values are language-specific and based on the Universal Dependencies scheme.
形態(tài)規(guī)則設(shè)置token的屬性卖词,比如詞形巩那,鍵的擴(kuò)展詞性標(biāo)簽和token的文本。詞法(形態(tài))特征及其可能的值為語(yǔ)言特征此蜈,且基于Universal Dependencies體系即横。
EXAMPLE
from ..symbols import LEMMA
MORPH_RULES = {
??? "VBZ": {
??????? "am": {LEMMA:"be", "VerbForm": "Fin", "Person":"One", "Tense": "Pres", "Mood":"Ind"},
??????? "are": {LEMMA:"be", "VerbForm": "Fin", "Person":"Two", "Tense": "Pres", "Mood":"Ind"},
??????? "is": {LEMMA:"be", "VerbForm": "Fin", "Person":"Three", "Tense": "Pres", "Mood":"Ind"},
??????? "'re": {LEMMA:"be", "VerbForm": "Fin", "Person":"Two", "Tense": "Pres", "Mood":"Ind"},
??????? "'s": {LEMMA:"be", "VerbForm": "Fin", "Person":"Three", "Tense": "Pres", "Mood":"Ind"}
??? }
}
上例中“am”的屬性如下:
IMPORTANT NOTE
The morphological attributes are currently not all used by spaCy. Full integration is still being developed. In the meantime, it can still be useful to add them, especially if the language you're adding includes important distinctions and special cases. This ensures that as soon as full support is introduced, your language will be able to assign all possible attributes.
重要!q烧浴东囚!
形態(tài)屬性目前沒(méi)有完全應(yīng)用于spaCy,完整內(nèi)容還在開(kāi)發(fā)中战授。其間页藻,加上該屬性還是很有用的,特別是如果添加的語(yǔ)言包含重要區(qū)別和特殊情況植兰。這樣就確保了當(dāng)完整支持完成后惕橙,就可以快速引入所有可能的屬性了。
Testing the language 測(cè)試語(yǔ)言
Before using the new language or submitting a pull request to spaCy, you should make sure it works as expected. This is especially important if you've added custom regular expressions for token matching or punctuation –you don't want to be causing regressions.
在使用一個(gè)新的語(yǔ)言或者向spaCy提交更新請(qǐng)求前钉跷,應(yīng)確定它能達(dá)到預(yù)期弥鹦。特別重要的是如果添加了自定義token匹配或標(biāo)點(diǎn)符號(hào)的正則表達(dá)式,省的后悔爷辙。彬坏。。
SPACY'STEST SUITE
spaCy uses the pytest framework for testing.For more details on how the tests are structured and best practices for writing your own tests, see our tests documentation.
spaCy的測(cè)試包
spaCy使用pytest框架進(jìn)行測(cè)試膝晾。關(guān)于更多的測(cè)試結(jié)構(gòu)和制作自己的測(cè)試的最佳方案栓始,參見(jiàn)測(cè)試文檔https://github.com/explosion/spaCy/blob/master/spacy/tests
The easiest way to test your new tokenizer is to run the language-independent "tokenizer sanity" tests located in tests/tokenizer . This will test for basic behaviours like punctuation splitting, URL matching and correct handling of whitespace. In the conftest.py, add the new language ID to the list of _languages:
測(cè)試新tokenizer的最簡(jiǎn)單方法是運(yùn)行“tokenizer sanity”,位于tests/tokenizer血当。這將對(duì)一些基本功能進(jìn)行測(cè)試幻赚,如標(biāo)點(diǎn)符號(hào)分隔,URL匹配以及空格的正確處理(中文還空格臊旭?)落恼。在conftest.py文件的_languages列表中添加新語(yǔ)言的ID。
_languages = ['bn', 'da', 'de', 'en', 'es', 'fi', 'fr', 'he', 'hu', 'it','nb',
????????????? 'nl', 'pl', 'pt','sv', 'xx'] # new language here
GLOBAL TOKENIZER TEST EXAMPLE
# use fixture by adding it as an argument
def test_with_all_languages(tokenizer):
??? # will be performed on ALL language tokenizers
??? tokens = tokenizer(u'Some texthere.')
The language will now be included in the tokenizer test fixture, which is used by the basic tokenizer tests. If you want to add your own tests that should be run over all languages, you can use this fixture as an argument of your test function.
現(xiàn)在語(yǔ)言已經(jīng)被包含到tokenizer測(cè)試fixture里了离熏,用來(lái)進(jìn)行基本的tokenizer測(cè)試佳谦。如果想用自己的測(cè)試運(yùn)行所有語(yǔ)言,可以將這個(gè)fixture以參數(shù)形式加入測(cè)試函數(shù)滋戳。
Writing language-specific tests 寫(xiě)一個(gè)特定語(yǔ)言的測(cè)試
It's recommended to always add at least some tests with examples specific to the language. Language tests should be located in tests/lang? in a directory named after the language ID. You'll also need to create a fixture for your tokenizer in the conftest.py . Always use the get_lang_class() helper function within the fixture, instead of importing the class at the top of the file. This will load the language data only when it's needed. (Otherwise, all data would be loaded every time you run a test.)
強(qiáng)烈推薦為定制的語(yǔ)言添加測(cè)試集钻蔑。語(yǔ)言測(cè)試集應(yīng)位于tests/lang路徑內(nèi)的以語(yǔ)言ID命名的目錄中啥刻。同時(shí),需要在conftest.py中創(chuàng)建一個(gè)fixture咪笑。在fixture內(nèi)使用get_lang_class()函數(shù)可帽,不要在文件頭import class。這樣就會(huì)僅在需要時(shí)加載語(yǔ)言數(shù)據(jù)窗怒。(否則映跟,在每次執(zhí)行測(cè)試時(shí)都會(huì)加載所有數(shù)據(jù),就累了兜粘。申窘。。)
@pytest.fixture
def en_tokenizer():
??? returnutil.get_lang_class('en').Defaults.create_tokenizer()
When adding test cases, always parametrize them –this will make it easier for others to add more test cases without having to modify the test itself. You can also add parameter tuples, for example, a test sentence and its expected length, or a list of expected tokens. Here's an example of an English tokenizer test for combinations of punctuation and abbreviations:
添加測(cè)試案例時(shí)孔轴,使其參數(shù)化剃法,以便于別人方便的添加更多測(cè)試案例,而不用去修改測(cè)試主體路鹰。還可以添加參數(shù)元祖贷洲,比如:一條測(cè)試語(yǔ)句及其預(yù)期長(zhǎng)度,或者預(yù)期tokens的列表晋柱。下面例子是一個(gè)英文的標(biāo)點(diǎn)符號(hào)和縮寫(xiě)組合的tokenizer測(cè)試:
EXAMPLE TEST
@pytest.mark.parametrize('text,length', [
??? ("The U.S. Army likesShock and Awe.", 8),
??? ("U.N. regulations arenot a part of their concern.", 10),
??? ("“Isn't
it?”", 6)])
def test_en_tokenizer_handles_punct_abbrev(en_tokenizer, text, length):
??? tokens = en_tokenizer(text)
??? assert len(tokens) == length
Training訓(xùn)練一個(gè)語(yǔ)言模型
spaCy expects that common words will be cached in a Vocab instance. The vocabulary caches lexical features, and makes it easy to use information from unlabelled text samples in your models. Specifically, you'll usually want to collect word frequencies, and train word vectors. To generate the word frequencies from a large, raw corpus, you can use the word_freqs.py? script from the spaCy developer resources.
spaCy認(rèn)為一般詞匯都可以在詞匯表實(shí)例中獲得优构。詞匯獲得詞性標(biāo)注,使用模型為標(biāo)記的文本信息也變得簡(jiǎn)單了雁竞。特別是收集詞頻钦椭,訓(xùn)練詞向量。從一個(gè)又大又新大語(yǔ)料中生成詞頻碑诉,可以使用spaCy developer resources中的word_freqs.py彪腔。
Note that your corpus should not be preprocessed (i.e. you need punctuation for example). The word frequencies should be generated as a tab-separated file with three columns:
1、The number of times the word occurred in your language sample.
2进栽、The number of distinct documents the word occurred in.
3德挣、The word itself.
注意:語(yǔ)料需未經(jīng)預(yù)處理(即要為樣本加上標(biāo)點(diǎn)符號(hào))。詞頻文件應(yīng)被生成為tab分隔的三列內(nèi)容:
第一列:詞條在語(yǔ)言樣品出現(xiàn)的次數(shù)快毛。
第二列:出現(xiàn)詞條的文檔數(shù)
第三列:詞條內(nèi)容
ES_WORD_FREQS.TXT
6361109?????? 111 Aunque
23598543???? 111 aunque
10097056???? 111 claro
193454? 111 aro
7711123?????? 111 viene
12812323???? 111 mal
23414636???? 111 momento
2014580?????? 111 felicidad
233865? 111 repleto
15527??? 111 eto
235565? 111 deliciosos
17259079???? 111 buena
71155??? 111 Anímate
37705??? 111 anímate
33155??? 111 cuéntanos
2389171?????? 111 cuál
961576? 111 típico
BROWN CLUSTERS 布朗聚類
Additionally, you can use distributional similarity features provided by the Brown clustering algorithm.You should train a model with between 500 and 1000 clusters. A minimum frequency threshold of 10 usually works well.
另外格嗅,可以使用布朗聚類算法提供的分布相似性特征∵氲郏可以訓(xùn)練一個(gè)500-1000clusters的模型屯掖,最低頻的閥值為10通常效果不錯(cuò)。
You should make sure you use the spaCy tokenizer for your language to segment the text for your word frequencies. This will ensure that the frequencies refer to the same segmentation standards you'll be using at run-time. For instance, spaCy'sEnglish tokenizer segments "can't" into two tokens. If we segmented the text by whitespace to produce the frequency counts, we'll have incorrect frequency counts for the tokens "ca" and "n't".
你應(yīng)該確定要用spaCy的tokenizer為你的語(yǔ)言進(jìn)行詞頻的分詞没隘。這樣就可以確保在運(yùn)行時(shí)懂扼,詞頻參考相同的分詞標(biāo)準(zhǔn)。比如說(shuō)右蒲,spaCy的英文tokenizer將can’t分詞為兩個(gè)tokens阀湿。如果用空格處理詞頻計(jì)數(shù),結(jié)果將出現(xiàn)ca和n’t的錯(cuò)誤詞頻計(jì)數(shù)瑰妄。
Training the word vectors 訓(xùn)練詞向量
Word2vec and related algorithms let you train useful word similarity models from unlabelled text.This is a key part of using deep learning for NLP with limited labelled data.The vectors are also useful by themselves – they power the .similarity()methods in spaCy. For best results, you should pre-process the text with spaCy before training the Word2vec model. This ensures your tokenization will match.You can use our word vectors training script , which pre-processes the text with your language-specific tokenizer and trains the model using Gensim. The vectors.bin file should consist of one word and vector per line.
Word2vec以及相關(guān)算法能夠從未標(biāo)記文本中訓(xùn)練有用的詞條相似度模型陷嘴,這是對(duì)有限標(biāo)記數(shù)據(jù)NLP的關(guān)鍵部分。向量本身也是很有用的-power了spaCy中的.similarity()函數(shù)间坐。為了最佳結(jié)果灾挨,訓(xùn)練word2vec模型之前應(yīng)該先用spaCy對(duì)文本進(jìn)行預(yù)處理,這就確保了tokenizer能夠匹配竹宋±统危可以直接用spaCy的vector訓(xùn)練腳本(https://github.com/explosion/spacy-dev-resources/blob/master/training/word_vectors.py),對(duì)定制語(yǔ)言文本tokenizer進(jìn)行預(yù)處理蜈七,并且用Gensim(https://radimrehurek.com/gensim/)訓(xùn)練模型秒拔。vectors.bin文件的每一行包含一個(gè)詞條和向量值。
Training the tagger and parser 訓(xùn)練標(biāo)簽器和解釋器
You can now train the model using a corpus for your language annotated with Universal Dependencies.If your corpus uses the CoNLL-U format, i.e. files with the extension .conllu, you can use the convert command to convert it to spaCy's JSON format for training. Once you have your UD corpus transformed into JSON, you can train your model use the using spaCy's train?command.
現(xiàn)在可以用定制語(yǔ)言的語(yǔ)料和Universal Dependencies訓(xùn)練模型了飒硅。如果語(yǔ)料使用CoNLL-U格式砂缩,即以.conllu為擴(kuò)展名的文件,可以用convert命令將其轉(zhuǎn)換為spaCy的JSON格式進(jìn)行訓(xùn)練三娩。UD語(yǔ)料轉(zhuǎn)換為JSON后就可以用spaCy的train命令訓(xùn)練模型了庵芭。
For more details and examples of how to train the tagger and dependency parser, see the usage guide on training.
更關(guān)于訓(xùn)練tagger和parser的細(xì)節(jié)和樣例請(qǐng)看分析模型訓(xùn)練指南。