ASR Systems

Introduction

The ASR system can be categoried as three classes by its output.

  • Phonem
  • Graphem (grapheme-based pronunciation lexica)
  • Morpheme (morph-based language modeling)
System Criterion Output Network
DeepSpeech CTC grapheme CNN+bi-RNN
Wav2letter ASG grapheme CNN
Jasper CTC grapheme Dense Residual CNN
ESPNet . . .
Output Models

DeepSpeech

DeepSpeech[1] is developed by Baidu, trained by CTC criterion.

CTC aims at maximizing the “overall” score of paths in \mathcal{G}_{ctc} (\theta, T); for that purpose, it minimizes the Forward score:
CTC(\theta, T) = -\textrm{logadd}_{ \pi \in \mathcal{G}_{ctc} (\theta, T)} \sum_{t=1}^T f_{\pi_t}(x)
where the “l(fā)ogadd” operation, also often called “log-sum-exp” is defined as \textrm{logadd}(a, b) = \exp(\log(a) + \log(b)). By replacing the \textrm{logadd}(·) by a \max(·), Forward algorithm turns to Viterbi algorithm.

All networks have 38 million parameters.


Comparison of WER on a training and development set for various depths of RNN, with and without BatchNorm. The number of parameters is kept constant as the depth increases, thus the number of hidden units per layer decreases. All networks have 38 million parameters. The architecture “M RNN, N total” implies 1 layer of 1D convolution at the input, M consecutive bidirectional RNN layers, and the rest as fully-connected layers

Wav2letter

Wav2letter[2][3] is from Facebook, in which “Auto Segmentation Criterion” (ASG) is introduced. ASG is an alternative to CTC, with three differences

  1. there are no blank labels, which produces a much simpler graph.
  2. un-normalized scores on the nodes (and possibly un-normalized transition scores on the edges). It can easily plug an external language model.
  3. global normalization instead of per-frame normalization. It is necessary when using un-normalized scores on nodes or edges; it insures incorrect transcriptions will have a low confidence.
The CTC criterion graph. (a) Graph which represents all the acceptable sequences of letters (with the blank state denoted “?”), for the transcription “cat”. (b) Shows the same graph unfolded over 5 frames. There are no transitions scores. At each time step, nodes are assigned a conditional probability output by the neural network acoustic model.

An unfolded graph \mathcal{G}_{asg} over T frames for a given transcription \theta, as well as a fully connected graph \mathcal{G}_{full} over T frames (representing all possible sequence of letters)
\begin{equation} \begin{split} ASG(\theta, T) =& -\textrm{logadd}_{\pi \in \mathcal{G}_{asg} (\theta, T)} \sum_{t=1}^T (f_{\pi_t}(x) + g_{\pi_{t-1}, \pi_t}(x)) \\ &+ \textrm{logadd}_{\pi \in \mathcal{G}_{full} (\theta, T)} \sum_{t=1}^T (f_{\pi_t} (x) + g_{ \pi_{t-1},\pi_t}(x)) \\ \end{split} \end{equation}
where g_{i,j}(·) is a transition score model to jump from label i to label j. As for CTC, these two parts can be efficiently computed with the Forward algorithm. The graph \mathcal{G}_{full} is used for normalization purposes. Un-normalized transitions scores are possible on the edges. At each time step, nodes are assigned a conditional un-normalized score, output by the neural network acoustic model

The ASG criterion graph. (a) Graph which represents all the acceptable sequences of letters for the transcription “cat”. (b) Shows the same graph unfolded over 5 frames. (c) Shows the corresponding fully connected graph, which describe all possible sequences of letter; this graph is used for normalization purposes. Un-normalized transitions scores are possible on the edges. At each time step, nodes are assigned a conditional un-normalized score, output by the neural network acoustic model.

Jasper

Jasper[4] is by NVIDIA. The deepest Jasper variant uses 54 convolutional layers.

ESPNet

[5]

Kaldi

Reference


  1. Amodei, D., Ananthanarayanan, S., Anubhai, R., Bai, J., Battenberg, E., Case, C., Casper, J., Catanzaro, B., Cheng, Q., Chen, G., Chen, J., Chen, J., Chen, Z., Chrzanowski, M., Coates, A., Diamos, G., Ding, K., Du, N., Elsen, E., … Zhu, Z. (2016). Deep speech 2: End-to-end speech recognition in English and Mandarin. 33rd International Conference on Machine Learning, ICML 2016, 1, 312–321. ?

  2. Collobert, R., Puhrsch, C., & Synnaeve, G. (2016). Wav2Letter: an End-to-End ConvNet-based Speech Recognition System. 1–8. http://arxiv.org/abs/1609.03193 ?

  3. Pratap, V., Hannun, A., Xu, Q., Cai, J., Kahn, J., Synnaeve, G., Liptchinsky, V., & Collobert, R. (2019). Wav2Letter++: A Fast Open-source Speech Recognition System. ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, 2019-May, 6460–6464. https://doi.org/10.1109/ICASSP.2019.8683535 ?

  4. Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J. M., Nguyen, H., & Gadde, R. T. (2019). Jasper: An end-to-end convolutional neural acoustic model. Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 2019-Septe, 71–75. https://doi.org/10.21437/Interspeech.2019-1819 ?

  5. Watanabe, S., Hori, T., Karita, S., Hayashi, T., Nishitoba, J., Unno, Y., Soplin, N. E. Y., Heymann, J., Wiesner, M., Chen, N., Renduchintala, A., & Ochiai, T. (2018). ESPNet: End-to-end speech processing toolkit. Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 2018-Septe, 2207–2211. https://doi.org/10.21437/Interspeech.2018-1456 ?

最后編輯于
?著作權歸作者所有,轉載或內(nèi)容合作請聯(lián)系作者
  • 序言:七十年代末羹铅,一起剝皮案震驚了整個濱河市金蜀,隨后出現(xiàn)的幾起案子耘柱,更是在濱河造成了極大的恐慌魁兼,老刑警劉巖斜做,帶你破解...
    沈念sama閱讀 218,546評論 6 507
  • 序言:濱河連續(xù)發(fā)生了三起死亡事件膘融,死亡現(xiàn)場離奇詭異坑夯,居然都是意外死亡喂窟,警方通過查閱死者的電腦和手機庐椒,發(fā)現(xiàn)死者居然都...
    沈念sama閱讀 93,224評論 3 395
  • 文/潘曉璐 我一進店門椒舵,熙熙樓的掌柜王于貴愁眉苦臉地迎上來,“玉大人约谈,你說我怎么就攤上這事笔宿。” “怎么了窗宇?”我有些...
    開封第一講書人閱讀 164,911評論 0 354
  • 文/不壞的土叔 我叫張陵措伐,是天一觀的道長。 經(jīng)常有香客問我军俊,道長侥加,這世上最難降的妖魔是什么? 我笑而不...
    開封第一講書人閱讀 58,737評論 1 294
  • 正文 為了忘掉前任粪躬,我火速辦了婚禮担败,結果婚禮上,老公的妹妹穿的比我還像新娘镰官。我一直安慰自己提前,他們只是感情好,可當我...
    茶點故事閱讀 67,753評論 6 392
  • 文/花漫 我一把揭開白布泳唠。 她就那樣靜靜地躺著狈网,像睡著了一般。 火紅的嫁衣襯著肌膚如雪笨腥。 梳的紋絲不亂的頭發(fā)上拓哺,一...
    開封第一講書人閱讀 51,598評論 1 305
  • 那天,我揣著相機與錄音脖母,去河邊找鬼士鸥。 笑死,一個胖子當著我的面吹牛谆级,可吹牛的內(nèi)容都是我干的烤礁。 我是一名探鬼主播,決...
    沈念sama閱讀 40,338評論 3 418
  • 文/蒼蘭香墨 我猛地睜開眼肥照,長吁一口氣:“原來是場噩夢啊……” “哼脚仔!你這毒婦竟也來了?” 一聲冷哼從身側響起舆绎,我...
    開封第一講書人閱讀 39,249評論 0 276
  • 序言:老撾萬榮一對情侶失蹤鲤脏,失蹤者是張志新(化名)和其女友劉穎,沒想到半個月后亿蒸,有當?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體凑兰,經(jīng)...
    沈念sama閱讀 45,696評論 1 314
  • 正文 獨居荒郊野嶺守林人離奇死亡,尸身上長有42處帶血的膿包…… 初始之章·張勛 以下內(nèi)容為張勛視角 年9月15日...
    茶點故事閱讀 37,888評論 3 336
  • 正文 我和宋清朗相戀三年边锁,在試婚紗的時候發(fā)現(xiàn)自己被綠了姑食。 大學時的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片。...
    茶點故事閱讀 40,013評論 1 348
  • 序言:一個原本活蹦亂跳的男人離奇死亡茅坛,死狀恐怖音半,靈堂內(nèi)的尸體忽然破棺而出,到底是詐尸還是另有隱情贡蓖,我是刑警寧澤曹鸠,帶...
    沈念sama閱讀 35,731評論 5 346
  • 正文 年R本政府宣布,位于F島的核電站斥铺,受9級特大地震影響彻桃,放射性物質發(fā)生泄漏。R本人自食惡果不足惜晾蜘,卻給世界環(huán)境...
    茶點故事閱讀 41,348評論 3 330
  • 文/蒙蒙 一邻眷、第九天 我趴在偏房一處隱蔽的房頂上張望。 院中可真熱鬧剔交,春花似錦肆饶、人聲如沸。這莊子的主人今日做“春日...
    開封第一講書人閱讀 31,929評論 0 22
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽。三九已至竭鞍,卻和暖如春板惑,著一層夾襖步出監(jiān)牢的瞬間,已是汗流浹背笼蛛。 一陣腳步聲響...
    開封第一講書人閱讀 33,048評論 1 270
  • 我被黑心中介騙來泰國打工洒放, 沒想到剛下飛機就差點兒被人妖公主榨干…… 1. 我叫王不留,地道東北人滨砍。 一個月前我還...
    沈念sama閱讀 48,203評論 3 370
  • 正文 我出身青樓往湿,卻偏偏與公主長得像,于是被迫代替她去往敵國和親惋戏。 傳聞我的和親對象是個殘疾皇子领追,可洞房花燭夜當晚...
    茶點故事閱讀 44,960評論 2 355