This paper proposes an iterative retrieval augmented generation (RAG) approach to improve spoken language understanding (SLU) capabilities. First, speech retrieval over the training set is performed using a pretrained automatic speech recognition encoder. The corresponding texts and intent labels are then formulated as prompts to guide the SLU decoder, with an added prompt attention mechanism to strengthen attention between generation and prompts. Iterative search and generation occurs within 3 iterations, or earlier exit if similarity scores do not improve. Experiments demonstrate the proposed RAG approach substantially outperforms conventional end-to-end and cascaded SLU models in intent prediction from speech. This highlights the efficacy of incorporating relevant external knowledge through retrieval-based prompting to enhance SLU systems. The iterative process allows progressive refinement of predictions. Overall, this work shows promise for advancing SLU via iterative RAG.
Wenhan HanXiao XiaoEric JakobssonJun WangMykola PechenizkiyMeng Fang
Chaitanya Dhananjay JadhavChang LiuJun Zhao
Jia ZhuZhangze ChenChangqin HuangPasquale De Meo
Zhihong ShaoYeyun GongYelong ShenMinlie HuangNan DuanWeizhu Chen