Introduction

The ASR system can be categoried as three classes by its output.

Phonem
Graphem (grapheme-based pronunciation lexica)
Morpheme (morph-based language modeling)

System	Criterion	Output	Network
DeepSpeech	CTC	grapheme	CNN+bi-RNN
Wav2letter	ASG	grapheme	CNN
Jasper	CTC	grapheme	Dense Residual CNN
ESPNet	.	.	.

Output	Models

DeepSpeech

DeepSpeech^[1] is developed by Baidu, trained by CTC criterion.

CTC aims at maximizing the “overall” score of paths in $\mathcal{G}_{ctc} (\theta, T)$ ; for that purpose, it minimizes the Forward score:
$CTC(\theta, T) = -\textrm{logadd}_{ \pi \in \mathcal{G}_{ctc} (\theta, T)} \sum_{t=1}^T f_{\pi_t}(x)$
where the “l(fā)ogadd” operation, also often called “log-sum-exp” is defined as $\textrm{logadd}(a, b) = \exp(\log(a) + \log(b))$ . By replacing the $\textrm{logadd}(·)$ by a $\max(·)$ , Forward algorithm turns to Viterbi algorithm.

All networks have 38 million parameters.

Comparison of WER on a training and development set for various depths of RNN, with and without BatchNorm. The number of parameters is kept constant as the depth increases, thus the number of hidden units per layer decreases. All networks have 38 million parameters. The architecture “M RNN, N total” implies 1 layer of 1D convolution at the input, M consecutive bidirectional RNN layers, and the rest as fully-connected layers

Wav2letter

Wav2letter^[2]^[3] is from Facebook, in which “Auto Segmentation Criterion” (ASG) is introduced. ASG is an alternative to CTC, with three differences

there are no blank labels, which produces a much simpler graph.
un-normalized scores on the nodes (and possibly un-normalized transition scores on the edges). It can easily plug an external language model.
global normalization instead of per-frame normalization. It is necessary when using un-normalized scores on nodes or edges; it insures incorrect transcriptions will have a low confidence.

The CTC criterion graph. (a) Graph which represents all the acceptable sequences of letters (with the blank state denoted “?”), for the transcription “cat”. (b) Shows the same graph unfolded over 5 frames. There are no transitions scores. At each time step, nodes are assigned a conditional probability output by the neural network acoustic model.

An unfolded graph $\mathcal{G}_{asg}$ over $T$ frames for a given transcription $\theta$ , as well as a fully connected graph $\mathcal{G}_{full}$ over $T$ frames (representing all possible sequence of letters)
$\begin{equation} \begin{split} ASG(\theta, T) =& -\textrm{logadd}_{\pi \in \mathcal{G}_{asg} (\theta, T)} \sum_{t=1}^T (f_{\pi_t}(x) + g_{\pi_{t-1}, \pi_t}(x)) \\ &+ \textrm{logadd}_{\pi \in \mathcal{G}_{full} (\theta, T)} \sum_{t=1}^T (f_{\pi_t} (x) + g_{ \pi_{t-1},\pi_t}(x)) \\ \end{split} \end{equation}$
where $g_{i,j}(·)$ is a transition score model to jump from label $i$ to label $j$ . As for CTC, these two parts can be efficiently computed with the Forward algorithm. The graph $\mathcal{G}_{full}$ is used for normalization purposes. Un-normalized transitions scores are possible on the edges. At each time step, nodes are assigned a conditional un-normalized score, output by the neural network acoustic model

The ASG criterion graph. (a) Graph which represents all the acceptable sequences of letters for the transcription “cat”. (b) Shows the same graph unfolded over 5 frames. (c) Shows the corresponding fully connected graph, which describe all possible sequences of letter; this graph is used for normalization purposes. Un-normalized transitions scores are possible on the edges. At each time step, nodes are assigned a conditional un-normalized score, output by the neural network acoustic model.

Jasper

Jasper^[4] is by NVIDIA. The deepest Jasper variant uses 54 convolutional layers.

ESPNet

^[5]

Kaldi

Reference

Amodei, D., Ananthanarayanan, S., Anubhai, R., Bai, J., Battenberg, E., Case, C., Casper, J., Catanzaro, B., Cheng, Q., Chen, G., Chen, J., Chen, J., Chen, Z., Chrzanowski, M., Coates, A., Diamos, G., Ding, K., Du, N., Elsen, E., … Zhu, Z. (2016). Deep speech 2: End-to-end speech recognition in English and Mandarin. 33rd International Conference on Machine Learning, ICML 2016, 1, 312–321. ?
Collobert, R., Puhrsch, C., & Synnaeve, G. (2016). Wav2Letter: an End-to-End ConvNet-based Speech Recognition System. 1–8. http://arxiv.org/abs/1609.03193 ?
Pratap, V., Hannun, A., Xu, Q., Cai, J., Kahn, J., Synnaeve, G., Liptchinsky, V., & Collobert, R. (2019). Wav2Letter++: A Fast Open-source Speech Recognition System. ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, 2019-May, 6460–6464. https://doi.org/10.1109/ICASSP.2019.8683535 ?
Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J. M., Nguyen, H., & Gadde, R. T. (2019). Jasper: An end-to-end convolutional neural acoustic model. Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 2019-Septe, 71–75. https://doi.org/10.21437/Interspeech.2019-1819 ?
Watanabe, S., Hori, T., Karita, S., Hayashi, T., Nishitoba, J., Unno, Y., Soplin, N. E. Y., Heymann, J., Wiesner, M., Chen, N., Renduchintala, A., & Ochiai, T. (2018). ESPNet: End-to-end speech processing toolkit. Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 2018-Septe, 2207–2211. https://doi.org/10.21437/Interspeech.2018-1456 ?

ASR Systems