Training spaCy’s Statistical Models訓(xùn)練spaCy模型
This guide describe show to train new statistical models for spaCy's part-of-speech tagger, named entity recognizer and dependency parser. Once the model is trained, you can then save and load it.
本指南介紹如何訓(xùn)練新的spaCy模型:詞性標(biāo)注器,NER和依存句法分析模型啡专。模型訓(xùn)練完成后可以存儲(chǔ)和加載淑倾。
Training basics 訓(xùn)練基礎(chǔ)
spaCy's models are statistical and every "decision" they make – for example, which part-of-speech tag to assign, or whether a word is a named entity – is a prediction. This prediction is based on the examples the model has seen during training. To train a model, you first need training data – examples of text, and the labels you want the model to predict. This could be a part-of-speech tag, a named entity or any other information.
spaCy的模型是統(tǒng)計(jì)學(xué)的捺僻,作出的每一個(gè)“判別”都是預(yù)測(cè),例如:詞性標(biāo)注贤斜,或者是否命名實(shí)體渣淳。其預(yù)測(cè)基于模型在訓(xùn)練過(guò)程中見(jiàn)過(guò)的樣本层皱。訓(xùn)練一個(gè)模型,首先需要訓(xùn)練數(shù)據(jù)(文本樣本)孵班,以及希望模型預(yù)測(cè)出的標(biāo)記涉兽。可以是詞性標(biāo)簽篙程,命名實(shí)體或其他信息枷畏。
The model is then shown the unlabelled text and will make a prediction. Because we know the correct answer, we can give the model feedback on its prediction in the form of an error gradient of the loss function that calculates the difference between the training example and the expected output. The greater the difference, the more significant the gradient and the updates to our model.
之后,模型會(huì)找出未標(biāo)記的文本并作出預(yù)測(cè)房午。因?yàn)槲覀冎勒娲_答案矿辽,就可以給模型計(jì)算輸出的錯(cuò)誤結(jié)果反饋其與預(yù)期輸出的偏差。差異越大郭厌,對(duì)模型的提升更重要袋倔。
Training data: Examples and their annotations. 樣本及其注釋。
Text: The input text the model should predict a label for. 模型應(yīng)預(yù)測(cè)出的標(biāo)記內(nèi)容折柠。
Label: The label the model should predict. 模型應(yīng)預(yù)測(cè)出的標(biāo)記宾娜。
Gradient: Gradient of the loss function calculating the difference between input and expected
output. 損失函數(shù)計(jì)算的輸入值和預(yù)期輸出之間的差異度
When training a model, we don't just want it to memorise our examples – we want it to come up with theory that can be generalised across other examples. After all, we don't just want the model to learn that this one instance of "Amazon" right here is a company – we want it to learn that "Amazon", in contexts like this, is most likely a company. That's why the training data should always be representative of the data we want to process. A model trained on Wikipedia, where sentences in the first person are extremely rare, will likely perform badly on Twitter. Similarly, a model trained on romantic novels will likely perform badly on legal text.
訓(xùn)練模型時(shí),不僅僅希望其記住樣本扇售,還希望模型能夠進(jìn)行廣義的跨樣本推測(cè)前塔。畢竟我們不僅僅希望模型學(xué)到Amazon在這里公司這么一個(gè)實(shí)例,還希望它能夠?qū)W到Amazon在這樣的上下文語(yǔ)境中承冰,最可能是一個(gè)公司华弓。這就是為什么訓(xùn)練數(shù)據(jù)對(duì)于要處理的數(shù)據(jù)來(lái)說(shuō)應(yīng)該具有代表性。用維基數(shù)據(jù)訓(xùn)練的模型困乒,在句子中第一人稱(chēng)極為罕見(jiàn)寂屏,那么該模型在Twitter中很可能會(huì)表現(xiàn)不佳。同樣的娜搂,用言情小說(shuō)訓(xùn)練出的模型迁霎,在法律文本中也很可能表現(xiàn)不佳。
This also means that in order to know how the model is performing, and whether it's learning the right things, you don't only need training data – you'll also need evaluation data. If you only test the model with the data it was trained on, you'll have no idea how well it's generalising. If you want to train a model from scratch, you usually need at least a few hundred examples for both training and evaluation. To update an existing model, you can already achieve decent results with very few examples – as long as they're representative.
這也意味著為了了解模型的效果如何百宇,是否學(xué)習(xí)了正確的內(nèi)容考廉,不僅需要訓(xùn)練數(shù)據(jù),還需要評(píng)估數(shù)據(jù)携御。如果只用其訓(xùn)練數(shù)據(jù)進(jìn)行測(cè)試昌粤,不會(huì)知道其表現(xiàn)如何既绕。如果想從零開(kāi)始訓(xùn)練模型,通常需要至少幾百個(gè)訓(xùn)練和評(píng)估樣本婚苹。要更新一個(gè)已有模型岸更,可以用很少的樣本(只要具有代表性)即可獲得好的效果。
How do I get training data? 如何獲取訓(xùn)練數(shù)據(jù)
Collecting training data may sound incredibly painful – and it can be, if you're planning a large-scale annotation project. However, if your main goal is to update an existing model's predictions – for example, spaCy's named entity recognition –the hard part is usually not creating the actual annotations. It's finding representative examples and extracting potential candidates. The good news is, if you've been noticing bad performance on your data, you likely already have some relevant text, and you can use spaCy to bootstrap a first set of training examples. For example, after processing a few sentences, you may end up with the following entities, some correct, some incorrect.
收集訓(xùn)練數(shù)據(jù)可能聽(tīng)起來(lái)非常的痛苦(而且是如果計(jì)劃大規(guī)模的標(biāo)注目標(biāo)的話膊升,確實(shí)痛苦)壤追。然而徐裸,如果主要目標(biāo)是升級(jí)已有模型的預(yù)測(cè)能力,例如:spaCy的NER,最難的部分通常不是創(chuàng)建實(shí)際的標(biāo)注佩脊。好消息是亚再,如果曾注意到過(guò)數(shù)據(jù)的不良效果亲茅,那么看來(lái)已經(jīng)有了一些相關(guān)文本颁股,就可以用spaCy來(lái)構(gòu)建第一組訓(xùn)練樣本。例如:處理一批段落后征绸,可以以一些正確和不正確的實(shí)體結(jié)束久橙。
HOW MANY EXAMPLES DO I NEED? 需要多少樣本
As a rule of thumb, you should allocate at least 10% of your project resources to creating training and evaluation data. If you're looking to improve an existing model, you might be able to start off with only a handful of examples. Keep in mind that you'll always want a lot more than that for evaluation – especially previous errors the model has made. Otherwise, you won't be able to sufficiently verify that the model has actually made the correct generalisations required for your use case.
按經(jīng)驗(yàn)估計(jì),應(yīng)該用項(xiàng)目數(shù)據(jù)資源的10%來(lái)創(chuàng)建訓(xùn)練和評(píng)估數(shù)據(jù)管怠。如果要升級(jí)已有模型淆衷,或許可以僅從極少數(shù)樣本開(kāi)始。要記住始終需要更多評(píng)估渤弛,特別是模型之前發(fā)生過(guò)的錯(cuò)誤祝拯。否則,不能夠充分驗(yàn)證模型確實(shí)對(duì)指定情況做出了正確處理她肯。
Alternatively, the rule-based matcher can be a useful tool to extract tokens or combinations of tokens, as well as their start and end index in a document. In this case, we'll extract mentions of Google and assume they're an ORG.
或者佳头,提取tokens或tokens組及其在文檔中開(kāi)始和結(jié)束的位置時(shí),基于規(guī)則的匹配是比較有用的晴氨。下例中是對(duì)涉及Google的提取康嘉,并識(shí)別為ORG的結(jié)果。
Based on the few examples above, you can already create six training sentences with eight entities in total. Of course, what you consider a "correct annotation" will always depend on what you want the model to learn. While there are some entity annotations that are more or less universally correct –like Canada being a geopolitical entity – your application may have its very own definition of the NER annotation scheme.
基于上述樣例籽前,已經(jīng)能夠創(chuàng)建用8個(gè)實(shí)體創(chuàng)建6條訓(xùn)練語(yǔ)句了凄鼻。當(dāng)然,想要“正確的注釋”聚假,始終取決于想要模型學(xué)習(xí)的是什么。雖然有些實(shí)體很容易正確標(biāo)注(例如Canada為地理實(shí)體)闰非,應(yīng)用還是可以有自己私有的NER標(biāo)注體系定義膘格。
樣例:
train_data = [
???("Uber blew through $1 million a week", [(0, 4, 'ORG')]),
???("Android Pay expands to Canada", [(0, 11, 'PRODUCT'), (23,30, 'GPE')]),
???("Spotify steps up Asia expansion", [(0, 8, "ORG"),(17, 21, "LOC")]),
???("Google Maps launches location sharing", [(0, 11,"PRODUCT")]),
???("Google rebrands its business apps", [(0, 6,"ORG")]),
???("look what i found on google! ??", [(21, 27,"PRODUCT")])]
TIP: TRY THE PRODIGY ANNOTATION TOOL
If you need to label a lot of data, check out Prodigy, a new, active learning-powered annotation tool we've developed. Prodigy is fast and extensible, and comes with a modern web application that helps you collect training data faster. It integrates seamlessly with spaCy, pre-selects the most relevant examples for annotation, and lets you train and evaluate ready-to-use spaCy models.
ExplosionAI的自推廣告:試試標(biāo)注工具Prodigy(https://prodi.gy/)
確實(shí)是個(gè)很牛的東西,不過(guò)收費(fèi)财松。是Explosion自己做的收費(fèi)工具瘪贱,必要情況下買(mǎi)吧纱控,個(gè)人版$349,商業(yè)版$449菜秦。
Training with annotations 用標(biāo)注訓(xùn)練
The GoldParse object collects the annotated training examples, also called the gold standard. It's initialised with the Doc object it refers to, and keyword arguments specifying the annotations, like tags or entities. Its job is to encode the annotations, keep them aligned and create the C-level data structures required for efficient access. Here's an example of a simple GoldParse for part-of-speech tags:
GoldParser對(duì)象收集了標(biāo)注訓(xùn)練樣本甜害,也叫g(shù)oldstandard。其以Doc對(duì)象進(jìn)行初始化球昨,并且關(guān)鍵詞參數(shù)指定了標(biāo)注尔店,比如標(biāo)簽和實(shí)體。其工作是將標(biāo)注進(jìn)行編碼主慰,對(duì)齊并創(chuàng)建高效訪問(wèn)需要的C級(jí)數(shù)據(jù)結(jié)構(gòu)嚣州。
vocab = Vocab(tag_map={'N': {'pos':'NOUN'}, 'V': {'pos': 'VERB'}})
doc = Doc(vocab, words=['I', 'like','stuff'])
gold = GoldParse(doc, tags=['N', 'V','N'])
Using the Doc and its gold-standard annotations, the model can be updated to learn a sentence of three words with their assigned part-of-speech tags. The tag map is part of the vocabulary and defines the annotation scheme. If you're training a new language model, this will let you map the tags present in the treebank you train on to spaCy's tag scheme.
使用Doc對(duì)象及其gold-standard標(biāo)注,模型可以用其詞性標(biāo)簽升級(jí)到學(xué)習(xí)三字短語(yǔ)共螺。tag map是詞匯和定義標(biāo)注體系的一部分该肴。如果訓(xùn)練新語(yǔ)言模型,tag map會(huì)以spaCy標(biāo)簽體系映射訓(xùn)練的treebank里的標(biāo)簽藐不。
doc = Doc(Vocab(), words=['Facebook','released', 'React', 'in', '2014'])
gold = GoldParse(doc, entities=['U-ORG','O', 'U-TECHNOLOGY', 'O', 'U-DATE'])
The same goes for named entities. The letters added before the labels refer to the tags of the BILUO scheme – O is a token outside an entity, U an single entity unit, B the beginning of an entity, I a token inside an entity and L the last token of an entity.
命名實(shí)體也同樣匀哄,參照BILUO標(biāo)簽體系在標(biāo)記之前加一個(gè)字母(B/I/L/U/O)。
BILUO說(shuō)明:
TAG????????????? DESCRIPTION
BEGIN?? The first token of a multi-tokenentity.
IN????????? An inner token of a multi-tokenentity.
LAST????? The final token of a multi-tokenentity.
UNIT????? A single-token entity.
OUT????? A non-entity token.
WHYBILUO, NOT IOB? 為啥是BILUO而不是IOB
There are several coding schemes for encoding entity annotations as token tags. These coding schemes are equally expressive, but not necessarily equally learnable. Ratinov and Roth showed that the minimal Begin, In, Out scheme was more difficult to learn than the BILUO scheme that we use, which explicitly marks boundary tokens.
有很多實(shí)體標(biāo)注為標(biāo)簽的編碼體系雏蛮,這些編碼體系其實(shí)效果都一樣涎嚼,但是可學(xué)習(xí)型不同。Ratinov和Roth表示最小的IOB(In/Out/Begin)體系在學(xué)習(xí)是比BILUO體系要復(fù)雜困難底扳,因?yàn)锽ILUO明確標(biāo)識(shí)了邊界tokens铸抑。
Training data: The training examples.
Text and label: The current example.
Doc: A Doc object created from theexample text.
GoldParse: A GoldParse object of the Docand label.
nlp: The nlp object with the model.
Optimizer: A function that holds statebetween updates.
Update: Update the model's weights.
Of course, it's not enough to only show a model a single example once. Especially if you only have few examples, you'll want to train for a number of iterations. At each iteration, the training data is shuffled to ensure the model doesn't make any generalisations based on the order of examples. Another technique to improve the learning results is to set a dropout rate, a rate at which to randomly"drop" individual features and representations. This makes it harder for the model to memorise the training data. For example, a 0.25 dropout means that each feature or internal representation has a 1/4 likelihood of being dropped.
當(dāng)然,一次只有一個(gè)模型一個(gè)獨(dú)立樣本衷模。特別是如果只有少量樣本的時(shí)候鹊汛,想要訓(xùn)練一堆迭代器。每個(gè)迭代器都清洗訓(xùn)練數(shù)據(jù)阱冶,以確保模型不基于樣本順序作任何輸出刁憋。另一個(gè)提升學(xué)習(xí)結(jié)果的技術(shù)是設(shè)置流失率,一個(gè)隨機(jī)“扔下”個(gè)體特征和表述的頻率木蹬。這樣的話模型很難做到記住訓(xùn)練數(shù)據(jù)至耻。例如:0.25的流失率意味著每個(gè)特征或內(nèi)部表述有1/4的可能性被扔掉。
begin_training() : Start the training and return an optimizer function to update the model's weights. Can take an optional function converting the training data to spaCy's training format.
update() : Update the model with the training example and gold data.
to_disk() : Save the updated model to adirectory.
EXAMPLE TRAINING LOOP
optimizer = nlp.begin_training(get_data)
for itn in range(100):
???random.shuffle(train_data)
???for raw_text, entity_offsets in train_data:
???????doc = nlp.make_doc(raw_text)
???????gold = GoldParse(doc, entities=entity_offsets)
???????nlp.update([doc], [gold], drop=0.5, sgd=optimizer)
nlp.to_disk('/model')
nlp.update函數(shù)有以下參數(shù):
docs:?? Doc objects. The update method takes a sequence of them, so you can batch up your training examples. Alternatively, you can also pass in a sequence of raw texts.
Golds: GoldParse objects. The update method takes a sequence of them, so you can batch up your training examples. Alternatively, you can also pass in a dictionary containing the annotations.
Drop:?? Dropout rate. Makes it harder for the model to just memorise the data.
Sgd:??? An optimizer, i.e. a callable to update the model's weights. If not set, spaCy will create a new one and save it for further use.
Instead of writing your own training loop, you can also use the built-in train? command, which expects data in spaCy's JSON format. On each epoch, a model will be saved out to the directory. After training, you can use the package?command to generate an installable Python package from your model.
也可以直接使用內(nèi)置的train命令進(jìn)行訓(xùn)練镊叁,需要數(shù)據(jù)是spaCy要求的JSON格式尘颓。每次,一個(gè)模型會(huì)被存儲(chǔ)于目錄中晦譬。訓(xùn)練后疤苹,可以用package命令生成模型的python安裝包。
python -m spacy convert /tmp/train.conllu/tmp/data
python -m spacy train en /tmp/model/tmp/data/train.json -n 5
Simple training style
Instead of sequences of Doc and GoldParse objects, you can also use the "simple training style" and pass raw texts and dictionaries of annotations to nlp.update .The dictionaries can have the keys entities, heads, deps, tags and cats. This is generally recommended, as it removes one layer of abstraction, and avoids unnecessary imports. It also makes it easier to structure and load your training data.
可以使用simpletrain style替代Doc和GoldParse對(duì)象序列敛腌,然后把原始文本和標(biāo)注字典傳遞給nlp.update卧土。字典可以含有關(guān)鍵實(shí)體惫皱,heads,deps尤莺,tags和cats旅敷。通常推薦此做法,因?yàn)樗コ艘粋€(gè)抽象層颤霎,而且避免了不必要的import媳谁。它也簡(jiǎn)化了構(gòu)建和加載訓(xùn)練數(shù)據(jù)。
EXAMPLE ANNOTATIONS
{
???'entities': [(0, 4, 'ORG')],
???'heads': [1, 1, 1, 5, 5, 2, 7, 5],
???'deps': ['nsubj', 'ROOT', 'prt', 'quantmod', 'compound', 'pobj', 'det','npadvmod'],
???'tags': ['PROPN', 'VERB', 'ADP', 'SYM', 'NUM', 'NUM', 'DET', 'NOUN'],
???'cats': {'BUSINESS': 1.0}
}
SIMPLE TRAINING LOOP
TRAIN_DATA = [
????("Uber blew through $1 million a week", {'entities': [(0, 4,'ORG')]}),
????("Google rebrands its business apps", {'entities': [(0, 6,"ORG")]})]
nlp = spacy.blank('en')
optimizer = nlp.begin_training()
for i in range(20):
???random.shuffle(TRAIN_DATA)
???for text, annotations in TRAIN_DATA:
???????nlp.update([text], [annotations], sgd=optimizer)
nlp.to_disk('/model')
The above training loop leaves out a few details that can really improve accuracy – but the principle really is that simple. Once you've got your pipeline together and you want to tune the accuracy, you usually want to process your training examples in batches, and experiment with minibatch sizes and dropout rates, set via the drop keyword argument. See the Language?and Pipe? API docs for available options.
上述訓(xùn)練環(huán)節(jié)遺漏了一些可以真正提升精度的細(xì)節(jié)捷绑,但是原理就是那么簡(jiǎn)單韩脑。有時(shí)整理好pipline就想調(diào)整精度,想要批量處理訓(xùn)練樣本就試試minibatch粹污,大小和流失率呢可以通過(guò)drop參數(shù)來(lái)設(shè)置段多。參見(jiàn)Languige和pip API文檔。
Training the named entity recognizer 訓(xùn)練命名實(shí)體識(shí)別(NER)
All spaCy models support online learning, so you can update a pre-trained model with new examples. You'll usually need to provide many examples to meaningfully improve the system — a few hundred is a good start, although more is better.
spaCy的所有模型都支持在線學(xué)習(xí)壮吩,所以你可以用新樣本更新預(yù)訓(xùn)練模型进苍。通常需要提供很多樣本來(lái)提升系統(tǒng),幾百個(gè)起步鸭叙,多多益善觉啊。
You should avoid iterating over the same few examples multiple times, or the model is likely to"forget" how to annotate other examples. If you iterate over the same few examples, you're effectively changing the loss function. The optimizer will find a way to minimize the loss on your examples, without regard for the consequences on the examples it's no longer paying attention to. One way to avoid this "catastrophic forgetting" problem is to "remind"the model of other examples by augmenting your annotations with sentences annotated with entities automatically recognised by the original model.Ultimately, this is an empirical process: you'll need to experiment on your data to find a solution that works best for you.
應(yīng)該避免同一批樣本的多次迭代,不然模型可能會(huì)“忘記”如何標(biāo)注其他樣本的沈贝。迭代同一批樣本杠人,會(huì)有效改變損失率。優(yōu)化器會(huì)為樣本找到最小化損失的方法宋下,不顧后果的忽視樣本嗡善。避免這種“災(zāi)難性遺忘”問(wèn)題的方法是通過(guò)用普通模型自動(dòng)識(shí)別實(shí)體的段落標(biāo)注增加標(biāo)注,來(lái)提醒其他樣本的模型学歧,最后罩引,經(jīng)驗(yàn)之談:需要數(shù)據(jù)試驗(yàn)找到最好的解決方案。
TIP:CONVERTING ENTITY ANNOTATIONS 轉(zhuǎn)換實(shí)體標(biāo)注
You can train the entity recognizer with entity offsets or annotations in the BILUO scheme. The spacy.gold module also exposes two helper functions to convert offsets to BILUO tags, and BILUO tags to entity offsets.
可以用實(shí)體集或BILUO標(biāo)注體系來(lái)訓(xùn)練實(shí)體識(shí)別枝笨。spaCy.gold模塊也有兩個(gè)函數(shù)可以進(jìn)行實(shí)體集和BILUO標(biāo)簽的相互轉(zhuǎn)換袁铐。
Updating the Named Entity Recognizer 更新NER
This example shows how to update spaCy's entity recognizer with your own examples, starting off with an existing, pre-trained model, or from scratch using a blank Language class. To do this, you'll need example texts and the character offsets and labels of each entity contained in the texts.
下例為如何用自己的樣本更新spaCy的實(shí)體識(shí)別,基于已有的預(yù)訓(xùn)練模型或以空白語(yǔ)言類(lèi)從零開(kāi)始横浑。為此剔桨,需要樣本文本和字符集以及文本中每個(gè)實(shí)體的標(biāo)記。
spacy/examples/training/train_ner.py
Step by step guide
1徙融、Load the model you want to start with, or create an empty model using spacy.blank? with the ID of your language. If you're using a blank model, don't forget to add the entity recognizer to the pipeline. If you're using an existing model, make sure to disable all other pipeline components during training using nlp.disable_pipes . This way, you'll only be training the entity recognizer.
2领炫、Shuffle and loop over the examples. For each example, update the model by calling nlp.update ,which steps through the words of the input. At each word, it makes a prediction. It then consults the annotations to see whether it was right. If it was wrong, it adjusts its weights so that the correct action will score higher next time.
3、Save the trained model using nlp.to_disk .
4张咳、Test the model to make sure the entities in the training data are recognised correctly.
1帝洪、加載模型,或者用spacy.blank和語(yǔ)言ID創(chuàng)建一個(gè)空的模型脚猾。如果使用空模型葱峡,別忘了在pipline中加入實(shí)體識(shí)別。如果使用已有模型龙助,確認(rèn)在訓(xùn)練時(shí)關(guān)閉其他的pipline組件(nlp.disable_pips)砰奕。這個(gè)方法僅訓(xùn)練實(shí)體識(shí)別。
2提鸟、shuffle和loopover樣本军援。對(duì)于每個(gè)樣本,調(diào)用nlp.update來(lái)升級(jí)模型(詞單步輸入)称勋。對(duì)每個(gè)詞都會(huì)做一個(gè)預(yù)測(cè)胸哥。之后對(duì)比標(biāo)注是否正確。如果錯(cuò)了赡鲜,就調(diào)整比重空厌,所以下一次得分會(huì)更高些。
3银酬、nlp.to_disk嘲更,保存模型。
4揩瞪、測(cè)試赋朦。
Training an additional entity type
This script shows how to add a new entity type ANIMAL to an existing pre-trained NER model, or an empty Language class. To keep the example short and simple, only a few sentences are provided as examples. In practice, you'll need many more — a few hundred would be a good start. You will also likely need to mix in examples of other entity types, which might be obtained by running the entity recognizer over unlabelled sentences, and adding their annotations to the training set.
下面的腳本是給已有的NER模型(或空語(yǔ)言類(lèi))添加一個(gè)ANIMAL的實(shí)體類(lèi)型。樣例為了保持簡(jiǎn)短李破,僅使用了很少的樣本宠哄。實(shí)操中需要很多,幾百個(gè)起步吧喷屋。此外琳拨,最好在樣本中混入其他實(shí)體類(lèi)型,以及未標(biāo)記的句子屯曹,并在訓(xùn)練集中加入其標(biāo)注狱庇。
spacy/examples/training/train_new_entity_type.py
IMPORTANTNOTE
If you're using an existing model, make sure to mix in examples of other entity types that spaCy correctly recognized before. Otherwise, your model might learn the new type, but "forget" what it previously knew. This is also referred to as the"catastrophic forgetting" problem.
如果使用已有模型,一定要在樣本中加入經(jīng)spaCy正確識(shí)別的其他類(lèi)型實(shí)體恶耽。否則密任,模型可能會(huì)狗熊掰棒子,屬于“災(zāi)難性遺忘”問(wèn)題偷俭。
Step by step guide
1浪讳、Load the model you want to start with, or create an empty model using spacy.blank? with the ID of your language. If you're using a blank model, don't forget to add the entity recognizer to the pipeline. If you're using an existing model, make sure to disable all other pipeline components during training using nlp.disable_pipes . This way, you'll only be training the entity recognizer.
2、Add the new entity label to the entity recognizer using the add_label? method. You can access the entity recognizer in the pipeline via nlp.get_pipe('ner').
3涌萤、Loop over the examples and call nlp.update , which steps through the words of the input. At each word, it makes a prediction. It then consults the annotations, to see whether it was right. If it was wrong, it adjusts its weights so that the correct action will score higher next time.
4淹遵、Save the trained model using nlp.to_disk .
5口猜、Test the model to make sure the new entity is recognised correctly.
1、加載模型透揣,或者用spacy.blank和語(yǔ)言ID 創(chuàng)建一個(gè)空的模型济炎。如果使用空模型,別忘了在pipline中加入實(shí)體識(shí)別辐真。如果使用已有模型须尚,確認(rèn)在訓(xùn)練時(shí)關(guān)閉其他的pipline組件(nlp.disable_pips)。這個(gè)方法僅訓(xùn)練實(shí)體識(shí)別侍咱。
2耐床、用add_label函數(shù)給實(shí)體識(shí)別添加新的實(shí)體標(biāo)記。通過(guò)nlp.get_pip(‘ner’)可在pipline中獲得實(shí)體識(shí)別楔脯。
3撩轰、loop over樣本,調(diào)用nlp.update淤年,單步輸入詞钧敞。對(duì)每個(gè)詞都會(huì)做一個(gè)預(yù)測(cè)。之后對(duì)比標(biāo)注是否正確麸粮。如果錯(cuò)了溉苛,就調(diào)整比重,所以下一次得分會(huì)更高些弄诲。
4愚战、nlp.to_disk,保存模型齐遵。
5寂玲、測(cè)試模型。
Training the tagger and parser
Updating the Dependency Parser
This example shows how to train spaCy's dependency parser, starting off with an existing model ora blank model. You'll need a set of training examples and the respective heads and dependency label for each token of the example texts.
訓(xùn)練spaCy依存句法分析的樣例梗摇,基于已有或空的模型都可以拓哟。需要一組訓(xùn)練樣本,還有每個(gè)樣本文本中的每個(gè)token的head和依存關(guān)系標(biāo)記伶授。
spacy/examples/training/train_parser.py
Step by step guide
1断序、Load the model you want to start with, or create an empty model using spacy.blank? with the ID of your language. If you're using a blank model, don't forget to add the parser to the pipeline. If you're using an existing model, make sure to disable all other pipeline components during training using nlp.disable_pipes . This way, you'll only be training the parser.
2、Add the dependency labels to the parser using the add_label?method. If you're starting off with a pre-trained spaCy model, this is usually not necessary – but it doesn't hurt either, just to be safe.
3糜烹、Shuffle and loop over the examples. For each example, update the model by calling nlp.update ,which steps through the words of the input. At each word, it makes a prediction. It then consults the annotations to see whether it was right. If it was wrong, it adjusts its weights so that the correct action will score higher next time.
4违诗、Save the trained model using nlp.to_disk .
5、Test the model to make sure the parser works as expected.
1疮蹦、加載模型诸迟,或者用spacy.blank和語(yǔ)言ID創(chuàng)建一個(gè)空的模型。如果使用空模型,別忘了在pipline中加入parser阵苇。如果使用已有模型壁公,確認(rèn)在訓(xùn)練時(shí)關(guān)閉其他的pipline組件(nlp.disable_pips)。這個(gè)方法僅訓(xùn)練parser绅项。
2贮尖、用add_label函數(shù)為parser添加依存關(guān)系標(biāo)記。用已經(jīng)預(yù)訓(xùn)練的spaCy模型也沒(méi)啥關(guān)系趁怔,無(wú)傷害。
3薪前、shuffle和loopover樣本润努,對(duì)于每個(gè)樣本,調(diào)用nlp.update來(lái)升級(jí)模型(詞單步輸入)示括。對(duì)每個(gè)詞都會(huì)做一個(gè)預(yù)測(cè)铺浇。之后對(duì)比標(biāo)注是否正確。如果錯(cuò)了垛膝,就調(diào)整比重鳍侣,所以下一次得分會(huì)更高些。
4吼拥、nlp.to_disk倚聚,保存模型。
5凿可、測(cè)試模型惑折。
Updating the Part-of-speech Tagger
In this example, we're training spaCy's part-of-speech tagger with a custom tag map. We start off with a blank Language class, update its defaults with our custom tags and then train the tagger. You'll need a set of training examples and the respective custom tags, as well as a dictionary mapping those tags to theUniversal Dependencies scheme.
下面的樣例用自定義的tagmap訓(xùn)練spaCy的詞性標(biāo)注器。用一個(gè)空語(yǔ)言類(lèi)來(lái)開(kāi)始枯跑,再用自定義標(biāo)簽更新默認(rèn)的惨驶,之后訓(xùn)練tagger。需要一組訓(xùn)練樣本及其自定義標(biāo)簽敛助,此外還要一個(gè)標(biāo)簽與UD體系的映射字典粗卜。
spacy/examples/training/train_tagger.py
Step by step guide
1、Load the model you want to start with, or create an empty model using spacy.blank? with the ID of your language. If you're using a blank model, don't forget to add the tagger to the pipeline. If you're using an existing model, make sure to disable all other pipeline components during training using nlp.disable_pipes . This way, you'll only be training the tagger.
2纳击、Add the tag map to the tagger using the add_label? method.The first argument is the new tag name, the second the mapping to spaCy's coarse-grained tags, e.g. {'pos': 'NOUN'}.
3续扔、Shuffle and loop over the examples. For each example, update the model by calling nlp.update ,which steps through the words of the input. At each word, it makes a prediction. It then consults the annotations to see whether it was right. If it was wrong, it adjusts its weights so that the correct action will score higher next time.
4、Save the trained model using nlp.to_disk .
5评疗、Test the model to make sure the parser works as expected.
1测砂、加載模型,或者用spacy.blank和語(yǔ)言ID創(chuàng)建一個(gè)空的模型百匆。如果使用空模型砌些,別忘了在pipline中加入tagger。如果使用已有模型,確認(rèn)在訓(xùn)練時(shí)關(guān)閉其他的pipline組件(nlp.disable_pips)存璃。這個(gè)方法僅訓(xùn)練tagger仑荐。
2、用add_label函數(shù)為tagger添加tag
map纵东。第一個(gè)參數(shù)是新tag名粘招,第二個(gè)是對(duì)spaCy的coarse-grained tag的映射,即{‘pos’:’NOUN’}偎球。
3洒扎、shuffle和loopover樣本,對(duì)于每個(gè)樣本衰絮,調(diào)用nlp.update來(lái)升級(jí)模型(詞單步輸入)袍冷。對(duì)每個(gè)詞都會(huì)做一個(gè)預(yù)測(cè)。之后對(duì)比標(biāo)注是否正確猫牡。如果錯(cuò)了胡诗,就調(diào)整比重,所以下一次得分會(huì)更高些淌友。
4煌恢、nlp.to_disk,保存模型震庭。
5瑰抵、測(cè)試模型。
Training a parser for custom semantics
spaCy's parser component can be used to be trained to predict any type of tree structure over your input text – including semantic relations that are not syntactic dependencies. This can be useful to for conversational applications, which need to predict trees over whole documents or chat logs, with connections between the sentence roots used to annotate discourse structure. For example, you can train spaCy's parser to label intents and their targets, like attributes, quality, time and locations. The result could look like this:
spaCy的parser組件可以用來(lái)訓(xùn)練成預(yù)測(cè)輸入文本中的任何樹(shù)結(jié)構(gòu)归薛,包括非句法依賴的語(yǔ)義關(guān)系谍憔。這對(duì)于會(huì)話應(yīng)用很有用,可以對(duì)整個(gè)文檔或聊天記錄進(jìn)行樹(shù)預(yù)測(cè)(用連接線連接并標(biāo)注語(yǔ)義)主籍。例如习贫,可以訓(xùn)練spaCy的parser來(lái)標(biāo)記目的及其目標(biāo),比如屬性千元、數(shù)量苫昌、時(shí)間和位置,效果如下:
doc = nlp(u"find a hotel with good wifi")
print([(t.text, t.dep_, t.head.text) fort in doc if t.dep_ != '-'])
# [('find', 'ROOT', 'find'), ('hotel','PLACE', 'find'),
#?('good', 'QUALITY', 'wifi'), ('wifi', 'ATTRIBUTE', 'hotel')]
The above tree attaches "wifi" to "hotel" and assigns the dependency labelATTRIBUTE. This may not be a correct syntactic dependency – but in this case, it expresses exactly what we need: the user is looking for a hotel with the attribute "wifi" of the quality "good". This query can then be processed by your application and used to trigger the respective action –e.g. search the database for hotels with high ratings for their wifi offerings.
上面的樹(shù)中幸海,將wifi附給hotel祟身,并標(biāo)注依存關(guān)系標(biāo)記ATTRIBUTE。這不一定是正確的依存關(guān)系物独,但是在這個(gè)情況下袜硫,確切表述了需要的東西:用戶想要找一個(gè)wifi質(zhì)量good的hotel。這個(gè)檢索就會(huì)被應(yīng)用處理挡篓,且出發(fā)各自的動(dòng)作婉陷,即:在數(shù)據(jù)庫(kù)中搜索wifi評(píng)級(jí)高的hotels帚称。
TIP:MERGE PHRASES AND ENTITIES
To achieve even better accuracy, try merging multi-word tokens and entities specific to your domain into one token before parsing your text. You can do this by running the entity recognizer or rule-based matcher to find relevant spans, and merging them using Span.merge . You could even add your own custom pipeline component to do this automatically – just make sure to add it before='parser'.
合并短語(yǔ)和實(shí)體
為達(dá)到更高精度,可以在parsing文本之前將多詞(字)tokens和相關(guān)實(shí)體合并為一個(gè)token秽澳〈扯茫可以通過(guò)運(yùn)行基于規(guī)則的匹配或?qū)嶓w識(shí)別來(lái)找到有關(guān)段落,然后用Span.merge合并他們担神。還可以添加自定義pipline組件來(lái)自動(dòng)合并楼吃,注意要添加在 =’parser’前面。
The following example shows a full implementation of a training loop for a custom message parser fora common "chat intent": finding local businesses. Our message semantics will have the following types of relations: ROOT, PLACE, QUALITY,ATTRIBUTE, TIME and LOCATION.
下例是自定義一個(gè)完整的聊天內(nèi)容parser的訓(xùn)練:查找當(dāng)?shù)厣虡I(yè)信息妄讯。信息語(yǔ)義包括如下關(guān)系類(lèi)型:ROOT孩锡,PLACE,QUALITY亥贸,ATTRIBUTE浮创,TIME以及LOCATION。
spacy/examples/training/train_intent_parser.py
Step by step guide
1砌函、Create the training data consisting of words, their heads and their dependency labels in order. A token's head is the index of the token it is attached to. The heads don't need to be syntactically correct – they should express the semantic relations you want the parser to learn. For words that shouldn't receive a label, you can choose an arbitrary placeholder, for example -.
2、Load the model you want to start with, or create an empty model using spacy.blank? with the ID of your language. If you're using a blank model, don't forget to add the custom parser to the pipeline. If you're using an existing model, make sure to remove the old parser from the pipeline, and disable all other pipeline components during training using nlp.disable_pipes . This way, you'll only be training the parser.
3溜族、Add the dependency labels to the parser using the add_label?method.
4讹俊、Shuffle and loop over the examples. For each example, update the model by calling nlp.update ,which steps through the words of the input. At each word, it makes a prediction. It then consults the annotations to see whether it was right. If it was wrong, it adjusts its weights so that the correct action will score higher next time.
5、Save the trained model using nlp.to_disk .
6煌抒、Test the model to make sure the parser works as expected.
1仍劈、創(chuàng)建由詞組成的訓(xùn)練數(shù)據(jù),及其head和依存關(guān)系label寡壮。一個(gè)token的head是token歸屬的索引贩疙。head不要求語(yǔ)法準(zhǔn)確,只需能表述出語(yǔ)義關(guān)系即可况既。對(duì)于不應(yīng)有l(wèi)abel的字詞这溅,隨便找個(gè)占位符就行了,比如“-”棒仍。
2悲靴、加載模型,或者用spacy.blank和語(yǔ)言ID創(chuàng)建一個(gè)空的模型莫其。如果使用空模型癞尚,別忘了在pipline中加入自定義parser。如果使用已有模型乱陡,確認(rèn)在訓(xùn)練時(shí)關(guān)閉其他的pipline組件(nlp.disable_pips)浇揩。這個(gè)方法僅訓(xùn)練parser。
3憨颠、用add_label函數(shù)為parser添加依存關(guān)系label胳徽。
4、shuffle和loopover樣本,對(duì)于每個(gè)樣本膜廊,調(diào)用nlp.update來(lái)升級(jí)模型(詞單步輸入)乏沸。對(duì)每個(gè)詞都會(huì)做一個(gè)預(yù)測(cè)。之后對(duì)比標(biāo)注是否正確爪瓜。如果錯(cuò)了蹬跃,就調(diào)整比重,所以下一次得分會(huì)更高些铆铆。
5蝶缀、nlp.to_disk,保存模型薄货。
6翁都、測(cè)試模型。
Training a text classification model
Adding a text classifier to a spaCy modelV2.0
This example shows how to train a multi-label convolutional neural network text classifier on IMDB movie reviews, using spaCy's new TextCategorizer component. The dataset will be loaded automatically via Thinc's built-in dataset loader. Predictions are available via Doc.cats .
下例是關(guān)于如何訓(xùn)練一個(gè)對(duì)IMDB影評(píng)進(jìn)行多標(biāo)簽卷積神經(jīng)網(wǎng)絡(luò)的文本分類(lèi)器谅猾,使用spaCy的TextCategorizer組件柄慰。數(shù)據(jù)集通過(guò)Thinc的內(nèi)置數(shù)據(jù)集加載器自動(dòng)加載。預(yù)測(cè)通過(guò)Doc.cats實(shí)現(xiàn)税娜。
spacy/examples/training/train_textcat.py
Step by step guide
1坐搔、Load the model you want to start with, or create an empty model using spacy.blank? with the ID of your language. If you're using an existing model, make sure to disable all other pipeline components during training using nlp.disable_pipes . This way, you'll only be training the text classifier.
2、Add the text classifier to the pipeline, and add the labels you want to train – for example,POSITIVE.
3敬矩、Load and pre-process the dataset, shuffle the data and split off a part of it to holdback for evaluation. This way, you'll be able to see results on each training iteration.
4概行、Loop over the training examples and partition them into batches using spaCy's minibatch? and compounding helpers.
5、Update the model by calling nlp.update , which steps through the examples and makes a prediction.It then consults the annotations to see whether it was right. If it was wrong, it adjusts its weights so that the correct prediction will score higher next time.
6弧岳、Optionally, you can also evaluate the text classifier on each iteration, by checking how it performs on the development data held back from the dataset. This lets you print the precision, recall and F-score.
7凳忙、Save the trained model using nlp.to_disk .
8、Test the model to make sure the text classifier works as expected.
1禽炬、加載模型涧卵,或者用spacy.blank和語(yǔ)言ID創(chuàng)建一個(gè)空的模型。如果使用已有模型腹尖,確認(rèn)在訓(xùn)練時(shí)關(guān)閉其他的pipline組件(nlp.disable_pips)艺演。這個(gè)方法僅訓(xùn)練text classifier。
2桐臊、在pipline中添加textclassifier胎撤,在添加想要訓(xùn)練的labels,比如:POSITIVE断凶。
3伤提、加載預(yù)處理過(guò)的數(shù)據(jù)集,清洗數(shù)據(jù)并分離出一部分做評(píng)估认烁。這樣就能看到每個(gè)訓(xùn)練迭代器的結(jié)果肿男。
4介汹、用spaCy的minibatch等對(duì)樣本進(jìn)行分批反復(fù)訓(xùn)練
5、調(diào)用nlp.update升級(jí)模型舶沛,樣本單步調(diào)試嘹承,做一個(gè)預(yù)測(cè)。之后對(duì)比標(biāo)注是否正確如庭。如果錯(cuò)了叹卷,就調(diào)整比重,所以下一次得分會(huì)更高些坪它。
6骤竹、可選,還可以在每個(gè)迭代器上做評(píng)估往毡,檢查效果如何蒙揣,輸出precision,recall和F-score开瞭。
7懒震、nlp.to_disk保存模型
8、測(cè)試模型嗤详。
Optimization tips and advice 優(yōu)化建議
There are lots of conflicting "recipes" for training deep neural networks at the moment. The cutting-edge models take a very long time to train, so most researchers can't run enough experiments to figure out what's really going on.For what it's worth, here's a recipe that seems to work well on a lot of NLP problems:
訓(xùn)練深度神經(jīng)網(wǎng)絡(luò)時(shí)存在一些矛盾的方法挎狸。前端模型訓(xùn)練用時(shí)很長(zhǎng),所以多數(shù)人不能運(yùn)行足夠的試驗(yàn)來(lái)找出到底咋回事断楷。無(wú)論怎樣,這里有個(gè)方法似乎在一些NLP問(wèn)題上還湊合崭别,如下:
1冬筒、Initialise with batch size 1, and compound to a maximum determined by your data size and problem type.
2、Use Adam solver with fixed learning rate.
3茅主、Use averaged parameters
4舞痰、Use L2 regularization.
5、Clip gradients byL2 norm to 1.
6诀姚、On small data sizes, start at a high dropout rate, with linear decay.
1响牛、用batch size 1初始化,并賦予最大值取決于數(shù)據(jù)尺寸和問(wèn)題類(lèi)型赫段。
2呀打、用Adam solver固定學(xué)習(xí)率。
3糯笙、用平均值贬丛。
4、用L2正則化给涕。
5豺憔、調(diào)整norm L2梯度為1额获。
6、小數(shù)據(jù)量恭应,從高流失率開(kāi)始線性衰減抄邀。
This recipe has been cobbled together experimentally. Here's why the various elements of the recipe made enough sense to try initially, and what you might try changing, depending on your problem.
此方法經(jīng)過(guò)拼裝試驗(yàn),所以值得一試昼榛,再根據(jù)實(shí)際情況進(jìn)行變更境肾。
Compounding batch size
The trick of increasing the batch size is starting to become quite popular (see Smith et al., 2017). Their recipe is quite different from how spaCy's models are being trained, but there are some similarities. In training the various spaCy models, we haven't found much advantage from decaying the learning rate – but starting with a low batch size has definitely helped. You should try it out on your data, and see how you go. Here's our current strategy:
提升batchsize的手段頗受歡迎。他們的方法也和spaCy的模型訓(xùn)練頗有不同褒纲,但是也有很多相似徐伐。訓(xùn)練不同的spaCy模型時(shí),沒(méi)有發(fā)現(xiàn)學(xué)習(xí)率衰退有多好琳要,反而低batchsize倒是很有用闯冷。你應(yīng)該用自己的數(shù)據(jù)試試看咋樣。下面是我們目前的策略:
BATCH HEURISTIC
def get_batches(train_data, model_type):
???max_batch_sizes = {'tagger': 32, 'parser': 16, 'ner': 16, 'textcat': 64}
???max_batch_size = max_batch_sizes[model_type]
???if len(train_data) < 1000:
???????max_batch_size /= 2
???if len(train_data) < 500:
???????max_batch_size /= 2
???batch_size = compounding(1, max_batch_size, 1.001)
???batches = minibatch(train_data, size=batch_size)
???return batches
This will set the batch size to start at 1, and increase each batch until it reaches a maximum size. The tagger, parser and entity recognizer all take whole sentences as input, so they're learning a lot of labels in a single example. You therefore need smaller batches for them. The batch size for the text categorizer should be somewhat larger, especially if your documents are long.
在這里batch設(shè)置為從1開(kāi)始彻秆,然后遞增至最大size楔绞。tagger,parser和實(shí)體識(shí)別都以整個(gè)句子作為輸入唇兑,所以他們?cè)趩螛颖局袑W(xué)到了很多l(xiāng)abels酒朵,因此就需要給他們小一些的batch。text categorizer的batch size應(yīng)該大一些扎附,特別是面對(duì)長(zhǎng)文檔蔫耽。
Learning rate, regularization and gradient clipping
By default spaCy uses the Adam solver, with default settings (learning rate 0.001, beta1=0.9, beta2=0.999). Some researchers have said they found these settings terrible on their problems – but they've always performed very well in training spaCy's models, in combination with the rest of our recipe. You can change these settings directly, by modifying the corresponding attributes on the optimizer object. You can also set environment variables, to adjust the defaults.
spaCy默認(rèn)使用Adamsolver(learning rate 0.001, beta1=0.9, beta2=0.999)。有人說(shuō)自己發(fā)現(xiàn)這樣設(shè)置對(duì)于他們的問(wèn)題來(lái)說(shuō)很糟糕留夜,但這些設(shè)置在spaCy模型的訓(xùn)練中一直表現(xiàn)很好匙铡,同時(shí)結(jié)合了我們其余的方法“啵可以直接修改這些設(shè)置鳖眼,直接針對(duì)優(yōu)化對(duì)象修改對(duì)應(yīng)屬性。也可以設(shè)置環(huán)境變量嚼摩,以調(diào)整默認(rèn)值钦讳。
There are two other key hyper-parameters of the solver: L2 regularization, and gradient clipping(max_grad_norm). Gradient clipping is a hack that's not discussed often, but everybody seems to be using. It's quite important in helping to ensure the network doesn't diverge, which is a fancy way of saying "fall over during training". The effect is sort of similar to setting the learning rate low.It can also compensate for a large batch size (this is a good example of how the choices of all these hyper-parameters intersect).
有其他兩個(gè)solver的key超參,L2正則和gradient clipping(max_grad_norm)枕面。Gradient clipping是一個(gè)不常討論的hack愿卒,但是每個(gè)人似乎都在用,它對(duì)于確保網(wǎng)絡(luò)不發(fā)散很重要潮秘,有一個(gè)有趣的說(shuō)法“訓(xùn)練中被絆倒”掘猿。結(jié)果很多類(lèi)似設(shè)置低學(xué)習(xí)率。這也可以補(bǔ)償一個(gè)大的batch size(這是一個(gè)如何交叉選擇所有這些超參數(shù)挺好的例子)唇跨。
Dropout rate
For small datasets, it's useful to set a high dropout rate at first, and decay it down towards amore reasonable value. This helps avoid the network immediately overfitting, while still encouraging it to learn some of the more interesting things in your data. spaCy comes with a decaying?utility function to facilitate this. You might try setting:
對(duì)于小數(shù)據(jù)集稠通,在一開(kāi)始設(shè)置高流失率很有用衬衬,之后之后降低到一個(gè)更合理的值。這樣有助于從數(shù)據(jù)中持續(xù)學(xué)習(xí)更多靠譜東西時(shí)避免網(wǎng)絡(luò)很快過(guò)擬合改橘。spaCy自帶一個(gè)decaying工具函數(shù)來(lái)搞這個(gè)事情滋尉,可以試試:
from spacy.util import decaying
dropout = decaying(0.6, 0.2, 1e-4)
You can then draw values from the iterator with next(dropout), which you would pass to the drop keyword argument of nlp.update . It's pretty much always a good idea to use at least some dropout. All of the models currently use Bernoulli dropout, for no particularly principled reason – we just haven't experimented with another scheme like Gaussian dropout yet.
之后可以從迭代器中用next(dropout)寫(xiě)參數(shù)值了,傳遞給nlp.update的drop參數(shù)飞主。多一些dropout幾乎一直是個(gè)好主意狮惜。所有模型目前都使用Bernoulli dropout,沒(méi)什么特殊原因碌识,我們只是還沒(méi)有用其他體系做過(guò)試驗(yàn)碾篡,比如Gaussian dropout之類(lèi)的。
Parameter averaging
The last part of our optimization recipe is parameter averaging, an old trick introduced by Freundand Schapire (1999), popularised in the NLP community by Collins (2002), and explained in more detail by Leon Bottou. Just about the only other people who seem to be using this for neural network training are the SyntaxNet team (one of whom is Michael Collins) – but it really seems to work great on every problem.
最后一部分優(yōu)化方法是參數(shù)平均筏餐, 從Freund and Schapire (1999))开泽,Collins(2002),到Leon Bottou解釋了更多細(xì)節(jié)魁瞪。大概僅有Michael Collins所在團(tuán)隊(duì)SyntaxNet在用穆律,不過(guò)看起來(lái)這個(gè)方法在所有問(wèn)題上都運(yùn)行良好。
The trick is to store the moving average of the weights during training. We don't optimize this average – we just track it. Then when we want to actually use the model, we use the averages, not the most recent value. In spaCy (and Thinc) this is done by using a context manager, use_params , to temporarily replace the weights:
其手段是存儲(chǔ)訓(xùn)練過(guò)程中權(quán)重的平均移動(dòng)值导俘。我們不去優(yōu)化這個(gè)平均值峦耘,只是跟蹤它。然后當(dāng)我們真的要用模型時(shí)旅薄,使用平均值而不是最近值辅髓。在spaCy和thinc中,用一個(gè)環(huán)境manager少梁,use_params洛口,去臨時(shí)替換權(quán)重來(lái)實(shí)現(xiàn)。
with nlp.use_params(optimizer.averages):
???nlp.to_disk('/model')
The context manager is handy because you naturally want to evaluate and save the model at various points during training (e.g. after each epoch). After evaluating and saving, the context manager will exit and the weights will be restored, so you resume training from the most recent value, rather than the average. By evaluating the model after each epoch, you can remove one hyper-parameter from consideration(the number of epochs). Having one less magic number to guess is extremely nice– so having the averaging under a context manager is very convenient.
Contextmanager很方便猎莲,因?yàn)樵谟?xùn)練過(guò)程中肯定想要在不通地方評(píng)估并存儲(chǔ)模型(例如:每個(gè)epoch之后)。評(píng)估和存儲(chǔ)之后技即,context manager會(huì)退出并重置權(quán)重值著洼,所以恢復(fù)訓(xùn)練是從最近值開(kāi)始而不是平均值。通過(guò)每個(gè)epoch之后評(píng)估模型而叼,可以考慮移除一個(gè)超參(epoch總數(shù))身笤。少猜一個(gè)數(shù)超爽,所以在contextmanager有個(gè)平均值很實(shí)用葵陵。
Transfer learning
Finally, if you're training from a small data set, it's very useful to start off with some knowledge already in the model. Word vectors are an easy and reliable way to do that, but depending on the application, you may also be able to start with useful knowledge from one of spaCy's pre-trained models, such as the parser, entity recogniser and tagger. If you're adapting a pre-trained model and you want it to retain accuracy on the tasks it was originally trained for, you should consider the "catastrophic forgetting" problem. See this blogpost to read more about the problem and our suggested solution, pseudo-rehearsal.
最后液荸,如果用一個(gè)小數(shù)據(jù)集進(jìn)行訓(xùn)練,使用一些已在模型中存在的知識(shí)非常有用脱篙。詞向量是一個(gè)簡(jiǎn)單而直接的方法娇钱,但是取決于實(shí)際應(yīng)用伤柄,或許還可以用一個(gè)spaCy的預(yù)訓(xùn)練模型,比如實(shí)體識(shí)別文搂,parser适刀,tagger。如果適配一個(gè)預(yù)訓(xùn)練模型煤蹭,而且想在實(shí)際任務(wù)中再訓(xùn)練原有訓(xùn)練內(nèi)容笔喉,恐怕會(huì)發(fā)生狗熊掰棒子問(wèn)題。關(guān)于問(wèn)題的更多內(nèi)容及建議解決方案硝皂,參考下文https://explosion.ai/blog/pseudo-rehearsal-catastrophic-forgetting
Saving and loading models
After training your model, you'll usually want to save its state, and load it back later. You can do this with the Language.to_disk()?method:
訓(xùn)練模型后常挚,通常會(huì)保存,以便之后加載稽物。Language奄毡。to_disk()函數(shù)nlp.to_disk( '/home/me/data/en_example_model' )
The directory will be created if it doesn't exist, and the whole pipeline will be written out. To make the model more convenient to deploy, we recommend wrapping it as a Python package.
如果指定目標(biāo)目錄不存在則會(huì)創(chuàng)建一個(gè),并且整個(gè)pipline將被寫(xiě)入姨裸。要使模型更實(shí)用秧倾,推薦打包為Python包。
Generating a model package
IMPORTANTNOTE
The model packages are not suitable for the public pypi.python.org directory, which is not designed for binary data and files over 50 MB. However, if your company is running an internal installation of PyPi, publishing your models on there can be a convenient way to share them with your team.
spaCy comes with a handy CLI command that will create all required files, and walk you through generating the meta data. You can also create the meta.json manually and place it in the model data directory, or supply a path to it using the --meta flag.For more info on this, see the package?docs.
spaCy自帶一個(gè)方便的CLI命令用來(lái)創(chuàng)建所有需要的文件傀缩,且直接生成元數(shù)據(jù)那先。可以手動(dòng)創(chuàng)建meta.json并放進(jìn)模型的數(shù)據(jù)目錄中赡艰,或者用 –meta flag提供一個(gè)路徑售淡。更多內(nèi)容參見(jiàn)package文檔:https://spacy.io/api/cli#package
META.JSON
{
???"name": "example_model",
???"lang": "en",
???"version": "1.0.0",
???"spacy_version": ">=2.0.0,<3.0.0",
???"description": "Example model for spaCy",
???"author": "You",
???"email": "you@example.com",
???"license": "CC BY-SA 3.0",
???"pipeline": ["tagger", "parser","ner"]
}
python -m spacy package/home/me/data/en_example_model /home/me/my_models
This command will create a model package directory that should look like this:
上面的命令將創(chuàng)建一個(gè)模型包,其目錄結(jié)構(gòu)如下所示:
DIRECTORY STRUCTURE
└── /
???├── MANIFEST.in???????????????????????????????? #to include meta.json
???├── meta.json????????????????????????????????????????? #model meta data
???├── setup.py???????????????????????????????????? #setup file for pip installation
???└── en_example_model????????????????? #model directory
???????├── __init__.py???????????????????????????? #init for pip installation
???????└── en_example_model-1.0.0? # model data
You can also find templates for all files on GitHub . If you're creating the package manually, keep in mind that the directories need to be named according to the naming conventions of lang_name and lang_name-version.
所有模版文件都可以在GitHub(https://github.com/explosion/spacy-models/blob/master/template )上找到慷垮。自己創(chuàng)建包的時(shí)候注意目錄的命名規(guī)則揖闸,lang_name和lang_name-version。
Customising the model setup
The meta.json includes the model details, like name, requirements and license, and lets you customise how the model should be initialised and loaded. You can define the language data to be loaded and the processing pipeline to execute.
Meta.json包括模型的細(xì)節(jié)料身,比如name汤纸,requirements和license,也允許自定義如何初始化和加載模型芹血。還可以定義加載語(yǔ)言數(shù)據(jù)和運(yùn)行處理pipline贮泞。
The load() method that comes with our model package templates will take care of putting all this together and returning a Language object with the loaded pipeline and data. If your model requires custom pipeline components or a custom language class, you can also ship the code with your model. For examples of this, check out the implementations of spaCy's load_model_from_init_py? and load_model_from_path? utility functions.
Load()函數(shù)自帶了模型包模版,能夠?qū)⑺袃?nèi)容整合并返回一個(gè)語(yǔ)言對(duì)象及其加載的pipline和數(shù)據(jù)幔烛。如果模型需要自定義pipline組件或自定義語(yǔ)言類(lèi)啃擦,也可以將編碼與模型一并封裝。例如:參看spaCy以下實(shí)現(xiàn)工具:load_model_from_init_py以及l(fā)oad_model_from_path饿悬。
Building the model package
To build the package, run the following command from within the directory. For more information on building Python packages, see the docs on Python's Setuptools.
Build包命令如下(更多內(nèi)容參見(jiàn)Python的Setuptools https://setuptools.readthedocs.io/en/latest/):
python setup.py sdist
This will create a.tar.gz archive in a directory /dist. The model can be installed by pointingpip to the path of the archive:
如上將在/dist目錄中創(chuàng)建一個(gè).tar.gz的壓縮包令蛉。該模型可以用pip install加包路徑進(jìn)行安裝:
pip install/path/to/en_example_model-1.0.0.tar.gz
You can then load the model via its name, en_example_model, or import it directly as a module and then call its load() method.
然后就可以用模型的名稱(chēng)加載模型了,或者直接引入模塊后用load()函數(shù)加載狡恬。
Loading a custom model package
To load a model from a data directory, you can use spacy.load()?with the local path. This will look for a meta.json in the directory and use the lang and pipeline settings to initialise a Language class with a processing pipeline and load in the model data.
從數(shù)據(jù)目錄加載模型:spacy.load()加本地路徑珠叔。之后會(huì)到指定目錄中查找meta.json并用lang和pipline設(shè)置用pipline來(lái)初始化一個(gè)語(yǔ)言類(lèi)蝎宇,并加載模型數(shù)據(jù)。
nlp = spacy.load('/path/to/model')
If you want to load only the binary data, you'll have to create a Language class and call from_disk? instead.
如果想加載bin运杭,需要?jiǎng)?chuàng)建一個(gè)語(yǔ)言類(lèi)夫啊,并使用from_disk。
nlp =spacy.blank('en').from_disk('/path/to/data')
IMPORTANT NOTE: LOADING DATA IN V2.X
In spaCy 1.x, the distinction between spacy.load() and the Language class constructor was quite unclear. You could call spacy.load() when no model was present, and it would silently return an empty object. Likewise, you could pass a path to English, even if the mode required a different language. spaCy v2.0 solves this with a clear distinction between setting up the instance and loading the data.
spaCy1.x版本中辆憔,spacy.load()和語(yǔ)言類(lèi)構(gòu)造器的區(qū)別很模糊撇眯。當(dāng)前沒(méi)有模型時(shí),可以調(diào)用spacy.load()也可以直接返回一個(gè)空對(duì)象虱咧。而且還可以傳遞一個(gè)English的路徑熊榛,即使模型需要其他語(yǔ)言。2.0版本解決了這個(gè)問(wèn)題腕巡。
正確方法:??? nlp= spacy.blank('en').from_disk('/path/to/data')
錯(cuò)誤方法:??? nlp= spacy.load('en', path='/path/to/data')
Example: How we're training and packaging models for spaCy
Publishing a new version of spaCy often means re-training all available models – currently, that's 13 models for 8 languages. To make this run smoothly, we're using an automated build process and a spacy train?template that looks like this:
每發(fā)布一個(gè)新版本通常意味著重新訓(xùn)練所有模型—當(dāng)前為8個(gè)語(yǔ)言的13個(gè)模型玄坦。為了平滑過(guò)渡,我們使用一個(gè)自動(dòng)build處理和一個(gè)spacytrain模版绘沉,長(zhǎng)相如下:
python -m spacy train {lang}{models_dir}/{name} {train_data} {dev_data} -m meta/{name}.json -V {version} -g{gpu_id} -n {n_epoch} -ns {n_sents}
META.JSON TEMPLATE
{
???"lang": "en",
???"name": "core_web_sm",
???"license":"CC BY-SA 3.0",
???"author":"Explosion AI",
???"url":"https://explosion.ai",
???"email":"contact@explosion.ai",
???"sources": ["OntoNotes 5", "CommonCrawl"],
???"description":"English multi-task CNN trained onOntoNotes, with GloVe vectors trained on common crawl. Assigns word vectors, context-specific token vectors, POS tags, dependency parse and named entities."
}
In a directory meta, we keep meta.json templates for the individual models, containing all relevant information that doesn't change across versions, like the name, description, author info and training data sources. When we train the model, we pass in the file to the meta template as the --meta argument, and specify the current model version as the --version argument.
在一個(gè)目錄內(nèi)煎楣,為每一個(gè)獨(dú)立的模型保留一個(gè)meta.json模版,包括所有跨版本無(wú)需變更的有關(guān)信息车伞,比如name择懂,description,authorinfo還有training data sources另玖。當(dāng)訓(xùn)練模型時(shí)困曙,傳遞meta模版的–meta參數(shù),還有指定當(dāng)前模型版本的—version參數(shù)谦去。
On each epoch, the model is saved out with a meta.json using our template and added properties, like the pipeline, accuracy scores and the spacy_version used to train the model. After training completion, the best model is selected automatically and packaged using the package? command.Since a full meta file is already present on the trained model, no further setup is required to build a valid model package.
每個(gè)epoch慷丽,模型都和一個(gè)meta.json一同保存并加入屬性,比如pipline鳄哭,accuracyscores以及spacy_version以用來(lái)訓(xùn)練模型要糊。訓(xùn)練完成后,最好的模型被自動(dòng)挑出來(lái)并用package命令打包妆丘。因?yàn)橐粋€(gè)完整的meta文件已經(jīng)在模型中存在了锄俄,所以不需要更多設(shè)置。
python -m spacy package -f {best_model}dist/
cd dist/{model_name}
python setup.py sdist
This process allows us to quickly trigger the model training and build process for all available models and languages, and generate the correct meta data automatically.
上述樣例為快速出發(fā)模型的訓(xùn)練并build所有模型和語(yǔ)言的處理飘痛,并自動(dòng)生成正確的meta數(shù)據(jù)珊膜。