JOURNAL ARTICLE

Exploring multimodal video representation for action recognition

Abstract

A video contains rich perceptual information, such as visual appearance, motion and audio, which can be used for understanding the activities in videos. Recent works have shown the combination of appearance (spatial) and motion (temporal) clues can significantly improve human action recognition performance in videos. To further explore the multimodal representation of video in action recognition, We propose a framework to learn a multimodal representations from video appearance, motion as well as audio data. Convolutional Neural Networks (CNN) are trained for each modality respectively. For fusing multiple features extracted with CNNs, we propose to add a fusion layer on the top of CNNs to learn a joint video representation. In fusion phase, we investigate both early fusion and late fusion with Neural Network and Support Vector Machine. Compare to existing works, (1) our work measures the benefits of taking audio information into consideration and (2) implements sophisticated fusion methods. The effectiveness of proposed approach is evaluated on UCF101 and UCF101-50 (selected subset in which each video contains audio data) for action recognition. The experimental results show that different modalities are complementary to each other and multimodal representation can be beneficial for final prediction. Furthermore, proposed fusion approach achieves 85.1% accuracy in fusing spatial-temporal on UCF101 (split 1), which is very competitive to state-of-the-art works.

Keywords:
Computer science Convolutional neural network Artificial intelligence Representation (politics) Modality (human–computer interaction) Action recognition Motion (physics) Pattern recognition (psychology) Optical flow Activity recognition Computer vision Speech recognition Class (philosophy) Image (mathematics)

Metrics

25
Cited By
1.17
FWCI (Field Weighted Citation Impact)
44
Refs
0.87
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Human Pose and Action Recognition
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Anomaly Detection Techniques and Applications
Physical Sciences →  Computer Science →  Artificial Intelligence
Hand Gesture Recognition Systems
Physical Sciences →  Computer Science →  Human-Computer Interaction

Related Documents

JOURNAL ARTICLE

Multimodal human action recognition based on spatio-temporal action representation recognition model

Qianhan WuQian HuangXing Li

Journal:   Multimedia Tools and Applications Year: 2022 Vol: 82 (11)Pages: 16409-16430
JOURNAL ARTICLE

Learning hierarchical video representation for action recognition

Qing LiZhaofan QiuTing YaoTao MeiYong RuiJiebo Luo

Journal:   International Journal of Multimedia Information Retrieval Year: 2017 Vol: 6 (1)Pages: 85-98
JOURNAL ARTICLE

Multi-Modality Video Representation for Action Recognition

Chao ZhuYike WangDongbing PuMiao QiHui SunLei Tan

Journal:   Journal on big data Year: 2020 Vol: 2 (3)Pages: 95-104
© 2026 ScienceGate Book Chapters — All rights reserved.