A product team is selecting an architecture for the speech-to-text component of a battery-powered wearable that will later run a neural text-to-speech engine. The chosen automatic speech recognition (ASR) model must satisfy several strict constraints: it needs to deliver partial transcriptions with a streaming latency under 300 ms; it must operate without a separate external language model due to severe memory limits; and it should maximize code and parameter sharing with the future TTS system.
Which approach best satisfies all of these requirements?
Listen-Attend-Spell (LAS) encoder-decoder with global attention
Recurrent Neural Network Transducer (RNN-T) trained end-to-end with transducer loss
Tacotron 2 sequence-to-sequence neural TTS model
Hybrid DNN/HMM system decoded with a weighted finite-state transducer and external n-gram language model
The RNN-Transducer (RNN-T) family was designed for real-time, left-to-right decoding and jointly trains an encoder (acoustic model), predictor (language model) and joint network with the transducer loss. Because the language component is baked into the network, an external n-gram or neural LM is unnecessary, which keeps the memory footprint low. RNN-T has been shown to run entirely on mobile and wearable-class hardware while emitting partial hypotheses as new frames arrive. Recent work also applies the same transducer loss to text-to-speech (e.g., TTS-Transducer), so the same framework can later be reused for generation, fulfilling the code-sharing goal.
LAS requires full-sequence attention, so streaming latency and memory are poor. Hybrid DNN/HMM systems need a separate WFST and LM, violating the memory constraint. Tacotron 2 is a TTS-only model and does not provide speech recognition at all. Therefore, the RNN-Transducer is the only option that meets every listed constraint.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is the role of the transducer loss in RNN-T?
Open an interactive chat with Bash
How does an RNN-T differ from a Hybrid DNN/HMM system in terms of memory usage?
Open an interactive chat with Bash
Why is Listen-Attend-Spell (LAS) unsuitable for real-time applications like wearables?