Joint End-to-End Spoken Language Understanding and Automatic Speech Recognition Training Based on Unified Speech-to-Text Pre-Training

Eesung Kim; Yun Tang; Taeyeon Ki; Divya Neelagiri; Vijendra Raj Apsingek

doi:10.1109/icassp48485.2024.10447509

ScienceGate Book Chapters

JOURNAL ARTICLE

Joint End-to-End Spoken Language Understanding and Automatic Speech Recognition Training Based on Unified Speech-to-Text Pre-Training

Eesung Kim Yun Tang Taeyeon Ki Divya Neelagiri Vijendra Raj Apsingek

Year: 2024 Pages: 10971-10975

DOI: 10.1109/icassp48485.2024.10447509

Get Full-Text PDF Get Analytical Report

Abstract

Modern spoken language understanding (SLU) approaches optimize the system in an end-to-end (E2E) manner. This approach offers two key advantages. Firstly, it helps mitigate error propagation from upstream systems. Secondly, combining various information types and optimizing them towards the same objective is straightforward. In this study, we attempt to build an SLU system by integrating information from two modalities, i.e., speech and text, and concurrently optimizing the associated tasks. We leverage a pre-trained model built with speech and text data and fine-tune it for the E2E SLU tasks. The SLU model is jointly optimized with automatic speech recognition (ASR) and SLU tasks under single-mode and dual-mode schemes. In the single-mode model, ASR and SLU results are predicted sequentially, whereas the dualmode model predicts either ASR or SLU outputs based on the task tag. Our proposed method demonstrates its superiority through benchmarking against FSC, SLURP, and in-house datasets, exhibiting improved intent accuracy, SLU-F1, and Word Error Rate (WER).

Keywords:

Computer science Leverage (statistics) Speech recognition Word error rate Language model Spoken language Task (project management) End-to-end principle Joint (building) Artificial intelligence Key (lock) Acoustic model Natural language processing Benchmarking Training set Speech processing

Metrics

Cited By

1.28

FWCI (Field Weighted Citation Impact)

Refs

0.75

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Speech Recognition and Synthesis

Physical Sciences → Computer Science → Artificial Intelligence

Topic Modeling

Physical Sciences → Computer Science → Artificial Intelligence

Natural Language Processing Techniques

Physical Sciences → Computer Science → Artificial Intelligence

Joint End-to-End Spoken Language Understanding and Automatic Speech Recognition Training Based on Unified Speech-to-Text Pre-Training

Abstract

Metrics

Citation History

Topics

Related Documents

Speech-Language Pre-Training for End-to-End Spoken Language Understanding

Non-Autoregressive End-to-End Approaches for Joint Automatic Speech Recognition and Spoken Language Understanding

Efficient Adaptation of Spoken Language Understanding based on End-to-End Automatic Speech Recognition

SPLAT: Speech-Language Joint Pre-Training for Spoken Language Understanding

Phoneme-to-Grapheme Conversion Based Large-Scale Pre-Training for End-to-End Automatic Speech Recognition