With the widespread application of deep learning methods, multimodal techniques have also achieved rapid development. Since single-modal speech recognition may affect the accuracy of recognition results in noisy environments, multimodal fusion recognition gradually replaces the traditional single-modal recognition methods. In this paper, we mainly strengthen and pre-process audio and video data first, and use LSTM recurrent neural network for deep feature extraction of audio and video streams, which effectively solves the problem of long-term forgetting of general neural networks. The audio and video feature vectors are then fused by a fully connected neural network with linear connections. Compared with the speech recognition technique alone, this audiovisual fusion recognition method has a better recognition effect in the case of noise interference. Compared with the traditional audiovisual recognition method, the model simplifies the recognition work. Recognition experiments on the LRS2-BBC dataset show that the recognition accuracy of this method improves to a certain extent over that of other methods in a clean environment and greatly improves in noisy conditions.
Rongfeng SuLan WangXunying Liu
Léon RothkrantzJacek C. WojdełPascal Wiggers
Alexey KarpovAndrey RonzhinIrina KipyatkovaAndrey RonzhinVasilisa VerkhodanovaAnton SavelievMiloš Železný
Darryl StewartRowan SeymourAdrian PassJi Ming
Youssef MrouehEtienne MarcheretVaibhava Goel