Abstract

Recurrent neural network transducers (RNN-T) are a promising end-to-end speech recognition framework that transduce input acoustic frames to a character sequence. Best- and breadth-first searches have been used as decoding strategies for RNN-T. However, best-first search follows a sequential process for its expansion search, which slows down the decoding process. Although breadth-first search replaces the sequential process of best-first search with a parallel one, it unnecessarily conducts an expansion search for all decoding steps. As most of the decoding frames correspond to a blank symbol because the length of the character sequence is much shorter than that of the decoding frames, this induces computational overhead. To address these limitations, we introduce an adaptive expansion search (AES) to accelerate RNN-T inference. AES overcomes the aforementioned limitations by batching the hypotheses and adopting a decision-making process that decides whether to continue the expansion search; thus, AES can avoid unnecessary expansion search. Furthermore, pruning is applied to AES for further acceleration. We achieved significant speedup and a lower word error rate compared with other baselines.

SPEECH/AUDIO

Accelerating RNN Transducer Inference via Adaptive Expansion Search

IEEE Signal Processing Letters

2020-11-06

Abstract