Introduction
The ASR system can be categoried as three classes by its output.
- Phonem
- Graphem (grapheme-based pronunciation lexica)
- Morpheme (morph-based language modeling)
System | Criterion | Output | Network |
---|---|---|---|
DeepSpeech | CTC | grapheme | CNN+bi-RNN |
Wav2letter | ASG | grapheme | CNN |
Jasper | CTC | grapheme | Dense Residual CNN |
ESPNet | . | . | . |
Output | Models |
---|---|
DeepSpeech
DeepSpeech[1] is developed by Baidu, trained by CTC criterion.
CTC aims at maximizing the “overall” score of paths in
; for that purpose, it minimizes the Forward score:
where the “l(fā)ogadd” operation, also often called “log-sum-exp” is defined as. By replacing the
by a
, Forward algorithm turns to Viterbi algorithm.
All networks have 38 million parameters.
Wav2letter
Wav2letter[2][3] is from Facebook, in which “Auto Segmentation Criterion” (ASG) is introduced. ASG is an alternative to CTC, with three differences
- there are no blank labels, which produces a much simpler graph.
- un-normalized scores on the nodes (and possibly un-normalized transition scores on the edges). It can easily plug an external language model.
- global normalization instead of per-frame normalization. It is necessary when using un-normalized scores on nodes or edges; it insures incorrect transcriptions will have a low confidence.
An unfolded graph
over
frames for a given transcription
, as well as a fully connected graph
over
frames (representing all possible sequence of letters)
whereis a transition score model to jump from label
to label
. As for CTC, these two parts can be efficiently computed with the Forward algorithm. The graph
is used for normalization purposes. Un-normalized transitions scores are possible on the edges. At each time step, nodes are assigned a conditional un-normalized score, output by the neural network acoustic model
Jasper
Jasper[4] is by NVIDIA. The deepest Jasper variant uses 54 convolutional layers.
ESPNet
Kaldi
Reference
-
Amodei, D., Ananthanarayanan, S., Anubhai, R., Bai, J., Battenberg, E., Case, C., Casper, J., Catanzaro, B., Cheng, Q., Chen, G., Chen, J., Chen, J., Chen, Z., Chrzanowski, M., Coates, A., Diamos, G., Ding, K., Du, N., Elsen, E., … Zhu, Z. (2016). Deep speech 2: End-to-end speech recognition in English and Mandarin. 33rd International Conference on Machine Learning, ICML 2016, 1, 312–321. ?
-
Collobert, R., Puhrsch, C., & Synnaeve, G. (2016). Wav2Letter: an End-to-End ConvNet-based Speech Recognition System. 1–8. http://arxiv.org/abs/1609.03193 ?
-
Pratap, V., Hannun, A., Xu, Q., Cai, J., Kahn, J., Synnaeve, G., Liptchinsky, V., & Collobert, R. (2019). Wav2Letter++: A Fast Open-source Speech Recognition System. ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, 2019-May, 6460–6464. https://doi.org/10.1109/ICASSP.2019.8683535 ?
-
Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J. M., Nguyen, H., & Gadde, R. T. (2019). Jasper: An end-to-end convolutional neural acoustic model. Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 2019-Septe, 71–75. https://doi.org/10.21437/Interspeech.2019-1819 ?
-
Watanabe, S., Hori, T., Karita, S., Hayashi, T., Nishitoba, J., Unno, Y., Soplin, N. E. Y., Heymann, J., Wiesner, M., Chen, N., Renduchintala, A., & Ochiai, T. (2018). ESPNet: End-to-end speech processing toolkit. Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 2018-Septe, 2207–2211. https://doi.org/10.21437/Interspeech.2018-1456 ?