JOURNAL ARTICLE

Multimodal fusion for audio-image and video action recognition

Muhammad Bilal ShaikhDouglas ChaiSyed Mohammed Shamsul IslamNaveed Akhtar

Year: 2024 Journal:   Neural Computing and Applications Vol: 36 (10)Pages: 5499-5513   Publisher: Springer Science+Business Media

Abstract

Abstract Multimodal Human Action Recognition (MHAR) is an important research topic in computer vision and event recognition fields. In this work, we address the problem of MHAR by developing a novel audio-image and video fusion-based deep learning framework that we call Multimodal Audio-Image and Video Action Recognizer (MAiVAR). We extract temporal information using image representations of audio signals and spatial information from video modality with the help of Convolutional Neutral Networks (CNN)-based feature extractors and fuse these features to recognize respective action classes. We apply a high-level weights assignment algorithm for improving audio-visual interaction and convergence. This proposed fusion-based framework utilizes the influence of audio and video feature maps and uses them to classify an action. Compared with state-of-the-art audio-visual MHAR techniques, the proposed approach features a simpler yet more accurate and more generalizable architecture, one that performs better with different audio-image representations. The system achieves an accuracy 87.9% and 79.0% on UCF51 and Kinetics Sounds datasets, respectively. All code and models for this paper will be available at https://tinyurl.com/4ps2ux6n .

Keywords:
Computer science Artificial intelligence Feature (linguistics) Audio visual Convolutional neural network Fuse (electrical) Image (mathematics) Modality (human–computer interaction) Code (set theory) Pattern recognition (psychology) Visualization Action (physics) Audio mining Speech recognition Computer vision Multimedia Acoustic model Speech processing Set (abstract data type)

Metrics

26
Cited By
13.78
FWCI (Field Weighted Citation Impact)
61
Refs
0.98
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Human Pose and Action Recognition
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Hand Gesture Recognition Systems
Physical Sciences →  Computer Science →  Human-Computer Interaction
Anomaly Detection Techniques and Applications
Physical Sciences →  Computer Science →  Artificial Intelligence
© 2026 ScienceGate Book Chapters — All rights reserved.