Language Model for ASR

Background

Automatic Speech Recognition (ASR) uses both acoustic model (AM) and language model (LM) to output the transcript of an input audio signal.
$y^? = \arg \max [\log P_{AM}(y|x) + \lambda \log P_{LM}(y)]$

Acoustic Model
Assigns a probability distribution over a vocabulary of characters given an audio frame.

Language Model

Task 1: Assigns a probability distribution over a sequence of words.

Task 2: Given a sequence of words, it assigns a probability to the whole sequence.
Given a corpus of tokens $x = (x_1, . . . , x_T)$ , the task of language modeling is to estimate the joint probability $P(x)$ (Task 2). $P(x)$ can be auto-regressively factorized as $P(x) = \prod_t P(x_t | x_{<t})$ by chain rule. With the factorization, the problem reduces to estimating each conditional factor (Task 1).

Structure

A proposed structure^[1] is shown bellow. With n-gram model (such as KenLM^[2]) is implement for decoding and gives ASR output, and NN LM(neural network language model) complements it. Beam search in hypothesis formation is governed by the joint score of ASR and KenLM, and the resulting beams are additionally re-scored with NN LM in a single forward pass.Using NN LM instead of KenLM all the way along is too expensive due to much slower inference of the former.

ASR correction model based on NN encoder-decoder architecture

Language Model

n-gram

The n-gram model is the most widely used language models.

Definition: an n-gram model is a probability distribution based on the $n$ th order Markov assumption
$P(w_i | w_1 . . . w_{i?1}) = Pr(w_i | w_{i-n-1} . . . w_{i?1})$

Maximum Likelihood Estimation
$c(w_1 . . . w_k)$ is the count of sequence $w_1 . . . w_k$ , the ML estimation for $c(w_1 . . . w_{n-1})\neq 0$ :
$P(w_n | w_1 . . . w_{n?1}) = \frac{c(w_1 . . . w_n)}{c(w_1 . . . w_{n-1})}$
The problem: $c(w_1 . . . w_n) = 0$ makes $P(w_n | w_1 . . . w_{n?1})=0$ .

N-Gram Model Problems

Sparsity: assigning probability zero to sequences not found in the sample ==> means speech recognition errors.
Solution:

Smoothing: adjusting ML estimates to reserve probability mass for unseen events. Typical form: interpolation of n-gram models, e.g., trigram, bigram, unigram frequencies.
$P(w_3|w_1w_2) = \alpha_1c(w3|w1w2) + \alpha_2c(w_3|w_2) + \alpha_3c(w_3).$
Some widely used techniques:

Katz Back-off models (Katz, 1987).

Interpolated models (Jelinek and Mercer, 1980).

Kneser-Ney models (Kneser and Ney, 1995).

Class-based models: create models based on classes (e.g., DAY) or phrases.

Representation: for $|\Sigma|=10^{5}$ , the number of bigrams is $10^{10}$ , the number of trigrams $10^{15}$ !
Solution:

Weighted automata: exploiting sparsity.

In deepspeech, a n-gram model called KenLM^[2] is used. KenLM exploits Hash table with linear probing. Linear probing places at most one entry in each bucket. When a collision occurs, linear probing places the entry to be inserted in the next (higher index) empty bucket, wrapping around as necessary.

Neural Network

A trainable neural network is used to encode the context $x<t$ into a fixed size hidden state, which is multiplied with the word embeddings to obtain the logits. The logits are then fed into the Softmax function, yielding a categorical probability distribution over the next token. The central problem is how to train a Transformer to effectively encode an arbitrarily long context into a fixed size representation.

Vanilla Model

One feasible but crude approximation is to split the entire corpus into shorter segments of manageable sizes, and only train the model within each segment, ignoring all contextual information from previous segments^[3] .

Masked Attention: To ensure that the model’s predictions are only conditioned on past characters, we mask our attention layers with a causal attention, so each position can only attend leftward.

Character transformer network of two layers pro- cessing a four character sequence to predict t_4. The causal attention mask limits information to left-to-right flow. Red arrows highlight the prediction task the network has to learn.

Auxiliary Losses: The training through addition auxiliary losses not only speed up convergence but also serve as an additional regularizer. Auxiliary losses only valid during training so that a number of the network parameters are only used during training--“training parameters”(distinguish between “inference parameters”). Three types of auxiliary losses, corresponding to intermediate positions, intermediate layers, and non-adjacent targets.

Positional Embeddings: The timing information may get lost during the propagation through the layers. To address this, replace the timing signal with a learned per-layer positional embedding added to the in-put sequence before each transformer layer, giving a total of L×N×512 additional parameters. (512-dimensional embedding vector, L: context positions, N: layers)

During evaluation, at each step, the vanilla model also consumes a segment of the same length as in training, but only makes one prediction at the last position. Then, at the next step, the segment is shifted to the right by only one position, and the new segment has to be processed all from scratch. The evaluation procedure is extremely expensive.

Illustration of the vanilla model with a segment length 4.

Transformer-XL

Transformer-XL (meaning extra long) is introduced to address the limitations of using a fixed-length context^[4]. During training, the hidden state sequence computed for the previous segment is fixed and cached to be reused as an extended context when the model processes the next new segment. Although the gradient still remains within a segment, this additional input allows the network to exploit information in the history, leading to an ability of modeling longer-term dependency and avoiding context fragmentation.

Illustration of the Transformer-XL model with a segment length 4.

Multi-layer Dependency: the recurrent dependency between the hidden states shifts one layer downwards per-segment, which differs from the same-layer recurrence in conventional RNN-LMs.

Faster Evaluation: during evaluation, the representations from the previous segments can be reused. 1,800+ times faster than the vanilla model.

Length-M Memory Extension: We can cache M (more than one) previous segments, and reuse all of them as the extra context when processing the current segment.

Relative Positional Encodings: how can we keep the positional information coherent when we reuse the states? Traditional method: input the transformer element-wise addition of the word embeddings and the positional encodings.
$h_{\tau+1} = f(h_{\tau}, E_{s_{\tau+1}}+U_{1:L})$
In this way, both $E_{s_{\tau+1}}$ and $E_{s_{\tau}}$ are associated with the same positional encoding $U_{1:L}$ .
In order to avoid this failure mode, the fundamental idea is to only encode the relative positional information in the hidden states. The positional encoding gives the model a temporal clue or “bias” about how information should be gathered, i.e., where to attend. Instead of incorporating bias statically into the initial embedding, one can inject the same information into the attention score of each layer.
In the standard Transformer:

The proposed relative positional encodings:

Replace all appearances of the absolute positional embedding $U_j$ for computing key vectors in term (b) and (d) with its relative counterpart $R_{i?j}$ .

Introduce a trainable parameter $u \in \mathbb{R}^d$ to replace the query $U_i^T W_i^T$ . It suggests that the attentive bias towards different words should re- main the same regardless of the query position.

Deliberately separate the two weight matrices $W_{k,E}$ and $W_{k,R}$ for producing the content-based key vectors and location-based key vectors respectively.

Summarize the computational procedure for a N-layer Transformer-XL with a single attention head:

Summary

In the analytic study^[5], it has empirically shown that a standard LSTM language model can effectively use about 200 tokens of context on two benchmark datasets, regardless of hyperparameter settings such as model size. It is sensitive to word order in the nearby context, but less so in the long-range context. In addition, the model is able to regenerate words from nearby context, but heavily relies on caches to copy words from far away.

Reference

Hrinchuk, O., Popova, M., & Ginsburg, B. (2020). Correction of Automatic Speech Recognition with Transformer Sequence-To-Sequence Model. ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, 2020-May, 7074–7078. https://doi.org/10.1109/ICASSP40776.2020.9053051 ?
Kenneth Heafield. (2011). KenLM: Faster and Smaller Language Model Queries. Proceedings of the Sixth Workshop on Statistical Machine Translation, 187--197. http://sites.google.com/site/murmurhash/%0Ahttp://kheafield.com/professional/avenue/kenlm.pdf ? ?
Al-Rfou, R., Choe, D., Constant, N., Guo, M., & Jones, L. (2019). Character-Level Language Modeling with Deeper Self-Attention. Proceedings of the AAAI Conference on Artificial Intelligence, 33, 3159–3166. https://doi.org/10.1609/aaai.v33i01.33013159 ?
Dai, Z., Yang, Z., Yang, Y., Carbonell, J., Le, Q. V., & Salakhutdinov, R. (2020). Transformer-XL: Attentive language models beyond a fixed-length context. ACL 2019 - 57th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference, 2978–2988. https://doi.org/10.18653/v1/p19-1285 ?
Khandelwal, U., He, H., Qi, P., & Jurafsky, D. (2018). Sharp nearby, fuzzy far away: How neural language models use context. ACL 2018 - 56th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference (Long Papers), 1, 284–294. https://doi.org/10.18653/v1/p18-1027 ?

最后編輯于：2020.08.26 09:45:43

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者

人面猴
序言：七十年代末貌笨，一起剝皮案震驚了整個濱河市碟贾，隨后出現(xiàn)的幾起案子缎谷，更是在濱河造成了極大的恐慌燎悍，老刑警劉巖，帶你破解...
沈念sama閱讀 218,546評論 6贊 507
死咒
序言：濱河連續(xù)發(fā)生了三起死亡事件晾腔，死亡現(xiàn)場離奇詭異意推，居然都是意外死亡舀凛，警方通過查閱死者的電腦和手機(jī)，發(fā)現(xiàn)死者居然都...
沈念sama閱讀 93,224評論 3贊 395
救了他兩次的神仙讓他今天三更去死
文/潘曉璐我一進(jìn)店門远寸，熙熙樓的掌柜王于貴愁眉苦臉地迎上來抄淑，“玉大人，你說我怎么就攤上這事驰后∷磷剩” “怎么了？”我有些...
開封第一講書人閱讀 164,911評論 0贊 354
道士緝兇錄：失蹤的賣姜人
文/不壞的土叔我叫張陵灶芝，是天一觀的道長郑原。經(jīng)常有香客問我，道長夜涕，這世上最難降的妖魔是什么犯犁？我笑而不...
開封第一講書人閱讀 58,737評論 1贊 294
?港島之戀（遺憾婚禮）
正文為了忘掉前任，我火速辦了婚禮女器，結(jié)果婚禮上酸役，老公的妹妹穿的比我還像新娘。我一直安慰自己驾胆，他們只是感情好涣澡，可當(dāng)我...
茶點(diǎn)故事閱讀 67,753評論 6贊 392
惡毒庶女頂嫁案：這布局不是一般人想出來的
文/花漫我一把揭開白布。她就那樣靜靜地躺著丧诺，像睡著了一般入桂。火紅的嫁衣襯著肌膚如雪。梳的紋絲不亂的頭發(fā)上驳阎，一...
開封第一講書人閱讀 51,598評論 1贊 305
城市分裂傳說
那天抗愁，我揣著相機(jī)與錄音，去河邊找鬼搞隐。笑死驹愚，一個胖子當(dāng)著我的面吹牛，可吹牛的內(nèi)容都是我干的劣纲。我是一名探鬼主播逢捺，決...
沈念sama閱讀 40,338評論 3贊 418
雙鴛鴦連環(huán)套：你想象不到人心有多黑
文/蒼蘭香墨我猛地睜開眼，長吁一口氣：“原來是場噩夢啊……” “哼癞季！你這毒婦竟也來了劫瞳？” 一聲冷哼從身側(cè)響起倘潜，我...
開封第一講書人閱讀 39,249評論 0贊 276
萬榮殺人案實(shí)錄
序言：老撾萬榮一對情侶失蹤，失蹤者是張志新（化名）和其女友劉穎志于，沒想到半個月后涮因，有當(dāng)?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體，經(jīng)...
沈念sama閱讀 45,696評論 1贊 314
?護(hù)林員之死
正文獨(dú)居荒郊野嶺守林人離奇死亡伺绽，尸身上長有42處帶血的膿包…… 初始之章·張勛以下內(nèi)容為張勛視角年9月15日...
茶點(diǎn)故事閱讀 37,888評論 3贊 336
?白月光啟示錄
正文我和宋清朗相戀三年养泡，在試婚紗的時候發(fā)現(xiàn)自己被綠了。大學(xué)時的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片奈应。...
茶點(diǎn)故事閱讀 40,013評論 1贊 348
活死人
序言：一個原本活蹦亂跳的男人離奇死亡澜掩，死狀恐怖，靈堂內(nèi)的尸體忽然破棺而出杖挣，到底是詐尸還是另有隱情肩榕，我是刑警寧澤，帶...
沈念sama閱讀 35,731評論 5贊 346
?日本核電站爆炸內(nèi)幕
正文年R本政府宣布惩妇，位于F島的核電站株汉，受9級特大地震影響，放射性物質(zhì)發(fā)生泄漏歌殃。R本人自食惡果不足惜乔妈，卻給世界環(huán)境...
茶點(diǎn)故事閱讀 41,348評論 3贊 330
男人毒藥：我在死后第九天來索命
文/蒙蒙一、第九天我趴在偏房一處隱蔽的房頂上張望挺份。院中可真熱鬧褒翰，春花似錦、人聲如沸匀泊。這莊子的主人今日做“春日...
開封第一講書人閱讀 31,929評論 0贊 22
一樁弒父案，背后竟有這般陰謀
文/蒼蘭香墨我抬頭看了看天上的太陽各聘。三九已至揣非，卻和暖如春，著一層夾襖步出監(jiān)牢的瞬間躲因，已是汗流浹背早敬。一陣腳步聲響...
開封第一講書人閱讀 33,048評論 1贊 270
情欲美人皮
我被黑心中介騙來泰國打工，沒想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留大脉，地道東北人搞监。一個月前我還...
沈念sama閱讀 48,203評論 3贊 370
代替公主和親
正文我出身青樓，卻偏偏與公主長得像镰矿，于是被迫代替她去往敵國和親琐驴。傳聞我的和親對象是個殘疾皇子，可洞房花燭夜當(dāng)晚...
茶點(diǎn)故事閱讀 44,960評論 2贊 355