Improving End-to-End Contextual Speech Recognition via a Word-Matching Algorithm with Backward Search

김준태(카카오엔터프라이즈), 이윤한(카카오엔터프라이즈)

IEEE Signal Processing Letters



End-to-end automatic speech recognition (E2E-ASR) prefers the common words during training rather than rare ones related to contextual information such as song names. Thus, recognizing contextual information correctly is a hurdle for E2E-ASR to reach the production-level. To overcome the limitations of E2E-ASR in recognizing contextual information, this work presents a post-processing followed by E2E-ASR in an algorithmic way, referred to as a word-matching algorithm with backward search (WMA-BS). At first, we allow E2E-ASR to roughly detect the position of target words that has similar pronunciation with desired contextual phrases. After that, given the hypothesis from E2E-ASR with the rough position of target words, WMA-BS estimates the correct target words and decides whether to replace the target words with the contextual phrase or not, according to their phonetic and literal similarity. Applying the proposed method to E2E-ASR achieved relative improvement up to 52.7% in word error rate across several harsh conditions.