論文

筆記锰霜,閱讀的時候?qū)懴陆∨溃浶圆惶猛庥馈润绎!?br> Bidirectional Encoder Representations from Transformers

介紹

在預(yù)訓(xùn)練bert結(jié)尾加輸出層,就可以實(shí)現(xiàn)多種任務(wù)的最先進(jìn)的模型鹃答。As a result, the pre-trained BERT representations can be fine-tuned with just one additional output layer to create state-of-the- art models for a wide range of tasks.
不需要為多種任務(wù)構(gòu)建不同的模型碍粥。e.g. 問答和語言推理question answering and language inference

貢獻(xiàn)點(diǎn):

  1. BERT使用深度雙向的模型,超越了之前的單向和淺層雙向特铝。
  2. 第一個通過fine-tune就能實(shí)現(xiàn)state of art結(jié)果
  3. BERT為11個NLP任務(wù)提供了最先進(jìn)的技術(shù)
    總之暑中,模型簡單, 但是效果好鲫剿。

在bert之前有:
feature-based:需要定義結(jié)構(gòu)鳄逾,e.g.ELMo
fine-tuning:優(yōu)化參數(shù), Generative Pre-trained Transformer (OpenAI GPT)
both using unidirectional language models to learn general language represen- tations.

bert訓(xùn)練: masked language model (MLM) 隨機(jī)將句子中的某個詞給抹掉灵莲,然后根據(jù)左右上下文預(yù)測雕凹。
next sentence prediction

現(xiàn)有task結(jié)果:
GLUE:80.4%
MultiNLI:86.7%
SQuAD v1.1 : F1 to 93.2
sentence level tasks:natural language inference, paraphrasing政冻。
token-level tasks:named entity recognition枚抵, SQuAD question answering

相關(guān)工作:

  1. Feature-based Approaches:
    1.1 Pre-trained word embeddings
    1.2 sentence embeddings
    1.3 paragraph embeddings
    1.4 ELMo 他們建議從語言模型中提取上下文敏感的特性。將上下文嵌入與特定于任務(wù)的架構(gòu)集成明场。
  2. Fine-tuning Approaches: transfer learning
    基于很少有參數(shù)需要重頭學(xué)汽摹,OpenAI GPT搞了
  3. Transfer Learning from Supervised Data

架構(gòu):

multi-layer bidirectional Transformer encoder
與這個相同,可以參考:Vaswani et al. (2017) as well as excellent guides such as “The Annotated Transformer.”

BERT-BASE: L=12, H=768, A=12, Total Pa- rameters=110M
BERT-LARGE: L=24, H=1024, A=16, Total Parameters=340M
BERT-BASE was chosen to have an identical model size as OpenAI GPT for comparison purposes

輸入介紹

image.png

第一個都是[CLS]對于分類任務(wù)最后是[SEP]苦锨,非分類任務(wù)不需要

  • We use WordPiece embeddings (Wu et al., 2016) with a 30,000 token vocabulary. We denote split word pieces with ##.
  • We use learned positional embeddings with supported sequence lengths up to 512 tokens.

架構(gòu)對比

image.png

GPT is trained on the BooksCorpus (800M words); BERT is trained on the BooksCor- pus (800M words) and Wikipedia (2,500M words).
GPT used the same learning rate of 5e-5 for all fine-tuning experiments; BERT chooses a task-specific fine-tuning learning rate which performs the best on the development set.

具體訓(xùn)練

pre-train BERT using two novel unsupervised prediction tasks
Masked LM. mask 15% one sentence逼泣。
Next Sentence Prediction:
50% of the time B is the actual next sentence that follows A, and 50% of the time it is a random sentence from the corpus.
They are sampled such that the combined length is ≤ 512 tokens.


image.png

image.png

training loss is the sum of the mean masked LM likelihood and mean next sentence prediction like- lihood.

實(shí)驗(yàn):

GLUE : The General Language Understanding Evaluation (GLUE) benchmark
is a collection of diverse natural language understand- ing tasks.

數(shù)據(jù)集介紹:

Alex Wang, Amapreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R Bowman. 2018. Glue: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461.

MNLI:

Multi-Genre Natural Language Inference is a large-scale, crowdsourced entailment classifi- cation task
Given a pair of sentences, the goal is to predict whether the second sentence is an entailment, contradiction, or neutral with respect to the first one.
Adina Williams, Nikita Nangia, and Samuel R Bow- man. 2018. A broad-coverage challenge corpus for sentence understanding through inference. In NAACL.

QQP:

Quora Question Pairs is a binary classifi- cation task where the goal is to determine if two questions asked on Quora are semantically equivalent
Z. Chen, H. Zhang, X. Zhang, and L. Zhao. 2018. Quora question pairs.

QNLI Question Natural Language Inference is a version of the Stanford Question Answering Dataset (Rajpurkar et al., 2016) which has been converted to a binary classification task (Wang et al., 2018). The positive examples are (ques- tion, sentence) pairs which do contain the correct answer, and the negative examples are (question, sentence) from the same paragraph which do not contain the answer.

SST-2 The Stanford Sentiment Treebank is a binary single-sentence classification task consist- ing of sentences extracted from movie reviews with human annotations of their sentiment (Socher et al., 2013).

CoLA The Corpus of Linguistic Acceptability is a binary single-sentence classification task, where the goal is to predict whether an English sentence is linguistically “acceptable” or not (Warstadt et al., 2018).

STS-B The Semantic Textual Similarity Bench- mark is a collection of sentence pairs drawn from news headlines and other sources (Cer et al., 2017). They were annotated with a score from 1 to 5 denoting how similar the two sentences are in terms of semantic meaning.

MRPC Microsoft Research Paraphrase Corpus consists of sentence pairs automatically extracted from online news sources, with human annotations for whether the sentences in the pair are semanti- cally equivalent (Dolan and Brockett, 2005).

RTE Recognizing Textual Entailment is a bi- nary entailment task similar to MNLI, but with much less training data (Bentivogli et al., 2009).6
WNLI Winograd NLI is a small natural lan- guage inference dataset deriving from (Levesque et al., 2011). The GLUE webpage notes that there are issues with the construction of this dataset, 7 and every trained system that’s been submitted to GLUE has has performed worse than the 65.1 baseline accuracy of predicting the majority class. We therefore exclude this set out of fairness to OpenAI GPT. For our GLUE submission, we al- ways predicted the majority class

SQuAD v1.1:span prediction task
100k crowdsourced question/answer pairs


image.png

At inference time, since the end prediction is not conditioned on the start, we add the constraint that the end must come after the start, but no other heuristics are used.(end必須在start之后)

Named Entity Recognition: CONLL
final hidden representation Ti ∈ RH for to each token i into a classification layer over the NER label set.
A cased WordPiece model is used for NER, whereas an uncased model is used for all other tasks.

image.png

分析

增加模型參數(shù)提高結(jié)果

Question: Does BERT really need such a large amount of pre-training (128,000 words/batch * 1,000,000 steps) to achieve high fine-tuning accuracy?
Answer: Yes, BERTBASE achieves almost 1.0% additional accuracy on MNLI when trained on 1M steps compared to 500k steps.


不同訓(xùn)練次數(shù)的影響
如果將bert作為特征的話,拼接后四層效果最好

不好重現(xiàn)結(jié)果舟舒,因?yàn)殚_發(fā)機(jī)的內(nèi)存不同圾旨,一次能夠處理的batch不同~

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
  • 序言:七十年代末,一起剝皮案震驚了整個濱河市魏蔗,隨后出現(xiàn)的幾起案子砍的,更是在濱河造成了極大的恐慌,老刑警劉巖莺治,帶你破解...
    沈念sama閱讀 219,366評論 6 508
  • 序言:濱河連續(xù)發(fā)生了三起死亡事件廓鞠,死亡現(xiàn)場離奇詭異,居然都是意外死亡谣旁,警方通過查閱死者的電腦和手機(jī)床佳,發(fā)現(xiàn)死者居然都...
    沈念sama閱讀 93,521評論 3 395
  • 文/潘曉璐 我一進(jìn)店門,熙熙樓的掌柜王于貴愁眉苦臉地迎上來榄审,“玉大人砌们,你說我怎么就攤上這事「榻” “怎么了浪感?”我有些...
    開封第一講書人閱讀 165,689評論 0 356
  • 文/不壞的土叔 我叫張陵,是天一觀的道長饼问。 經(jīng)常有香客問我影兽,道長,這世上最難降的妖魔是什么莱革? 我笑而不...
    開封第一講書人閱讀 58,925評論 1 295
  • 正文 為了忘掉前任峻堰,我火速辦了婚禮讹开,結(jié)果婚禮上,老公的妹妹穿的比我還像新娘捐名。我一直安慰自己旦万,他們只是感情好,可當(dāng)我...
    茶點(diǎn)故事閱讀 67,942評論 6 392
  • 文/花漫 我一把揭開白布镶蹋。 她就那樣靜靜地躺著纸型,像睡著了一般。 火紅的嫁衣襯著肌膚如雪梅忌。 梳的紋絲不亂的頭發(fā)上,一...
    開封第一講書人閱讀 51,727評論 1 305
  • 那天除破,我揣著相機(jī)與錄音牧氮,去河邊找鬼。 笑死瑰枫,一個胖子當(dāng)著我的面吹牛踱葛,可吹牛的內(nèi)容都是我干的。 我是一名探鬼主播光坝,決...
    沈念sama閱讀 40,447評論 3 420
  • 文/蒼蘭香墨 我猛地睜開眼尸诽,長吁一口氣:“原來是場噩夢啊……” “哼!你這毒婦竟也來了盯另?” 一聲冷哼從身側(cè)響起性含,我...
    開封第一講書人閱讀 39,349評論 0 276
  • 序言:老撾萬榮一對情侶失蹤,失蹤者是張志新(化名)和其女友劉穎鸳惯,沒想到半個月后商蕴,有當(dāng)?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體,經(jīng)...
    沈念sama閱讀 45,820評論 1 317
  • 正文 獨(dú)居荒郊野嶺守林人離奇死亡芝发,尸身上長有42處帶血的膿包…… 初始之章·張勛 以下內(nèi)容為張勛視角 年9月15日...
    茶點(diǎn)故事閱讀 37,990評論 3 337
  • 正文 我和宋清朗相戀三年绪商,在試婚紗的時候發(fā)現(xiàn)自己被綠了。 大學(xué)時的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片辅鲸。...
    茶點(diǎn)故事閱讀 40,127評論 1 351
  • 序言:一個原本活蹦亂跳的男人離奇死亡格郁,死狀恐怖,靈堂內(nèi)的尸體忽然破棺而出独悴,到底是詐尸還是另有隱情例书,我是刑警寧澤,帶...
    沈念sama閱讀 35,812評論 5 346
  • 正文 年R本政府宣布刻炒,位于F島的核電站雾叭,受9級特大地震影響,放射性物質(zhì)發(fā)生泄漏落蝙。R本人自食惡果不足惜织狐,卻給世界環(huán)境...
    茶點(diǎn)故事閱讀 41,471評論 3 331
  • 文/蒙蒙 一暂幼、第九天 我趴在偏房一處隱蔽的房頂上張望。 院中可真熱鬧移迫,春花似錦旺嬉、人聲如沸。這莊子的主人今日做“春日...
    開封第一講書人閱讀 32,017評論 0 22
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽。三九已至荡陷,卻和暖如春雨效,著一層夾襖步出監(jiān)牢的瞬間,已是汗流浹背废赞。 一陣腳步聲響...
    開封第一講書人閱讀 33,142評論 1 272
  • 我被黑心中介騙來泰國打工徽龟, 沒想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留,地道東北人唉地。 一個月前我還...
    沈念sama閱讀 48,388評論 3 373
  • 正文 我出身青樓据悔,卻偏偏與公主長得像,于是被迫代替她去往敵國和親耘沼。 傳聞我的和親對象是個殘疾皇子极颓,可洞房花燭夜當(dāng)晚...
    茶點(diǎn)故事閱讀 45,066評論 2 355

推薦閱讀更多精彩內(nèi)容

  • rljs by sennchi Timeline of History Part One The Cognitiv...
    sennchi閱讀 7,336評論 0 10
  • 第一,我們要認(rèn)清沉沒成本其實(shí)沒有好壞的區(qū)別群嗤,你可以把它叫做既定成本菠隆,或者是已經(jīng)發(fā)生的花費(fèi)。 第二狂秘,因?yàn)檫@種心態(tài)的頑...
    羅南南閱讀 144評論 0 0
  • 在網(wǎng)絡(luò)中浸赫,我們經(jīng)常會用到一些命令檢測網(wǎng)絡(luò)狀態(tài)或者一些網(wǎng)絡(luò)相關(guān)的信息,以下電腦高手或技術(shù)員常用的9大網(wǎng)絡(luò)命令你都知道...
    whatangle閱讀 453評論 0 1
  • miraclehen閱讀 415評論 0 0
  • 關(guān)于散文體裁是不是可以虛構(gòu)的話題赃绊,在文學(xué)界爭論了很久了既峡。大概的定論是散文是不能虛構(gòu)的,散文要寫真情實(shí)感碧查,很多講故事...
    楊仲凱律師閱讀 564評論 0 0