Exploring multimodal video representation for action recognition

Cheng Wang; Haojin Yang; Christoph Meinel

doi:10.1109/ijcnn.2016.7727435

ScienceGate Book Chapters

JOURNAL ARTICLE

Exploring multimodal video representation for action recognition

Cheng Wang Haojin Yang Christoph Meinel

Year: 2016 Pages: 1924-1931

DOI: 10.1109/ijcnn.2016.7727435

Get Full-Text PDF Get Analytical Report

Abstract

A video contains rich perceptual information, such as visual appearance, motion and audio, which can be used for understanding the activities in videos. Recent works have shown the combination of appearance (spatial) and motion (temporal) clues can significantly improve human action recognition performance in videos. To further explore the multimodal representation of video in action recognition, We propose a framework to learn a multimodal representations from video appearance, motion as well as audio data. Convolutional Neural Networks (CNN) are trained for each modality respectively. For fusing multiple features extracted with CNNs, we propose to add a fusion layer on the top of CNNs to learn a joint video representation. In fusion phase, we investigate both early fusion and late fusion with Neural Network and Support Vector Machine. Compare to existing works, (1) our work measures the benefits of taking audio information into consideration and (2) implements sophisticated fusion methods. The effectiveness of proposed approach is evaluated on UCF101 and UCF101-50 (selected subset in which each video contains audio data) for action recognition. The experimental results show that different modalities are complementary to each other and multimodal representation can be beneficial for final prediction. Furthermore, proposed fusion approach achieves 85.1% accuracy in fusing spatial-temporal on UCF101 (split 1), which is very competitive to state-of-the-art works.

Keywords:

Computer science Convolutional neural network Artificial intelligence Representation (politics) Modality (human–computer interaction) Action recognition Motion (physics) Pattern recognition (psychology) Optical flow Activity recognition Computer vision Speech recognition Class (philosophy) Image (mathematics)

Metrics

Cited By

1.17

FWCI (Field Weighted Citation Impact)

Refs

0.87

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Human Pose and Action Recognition

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Anomaly Detection Techniques and Applications

Physical Sciences → Computer Science → Artificial Intelligence

Hand Gesture Recognition Systems

Physical Sciences → Computer Science → Human-Computer Interaction

Exploring multimodal video representation for action recognition

Abstract

Metrics

Citation History

Topics

Related Documents

Exploring probabilistic localized video representation for human action recognition

ASPERA: Exploring Multimodal Action Recognition in Football Through Video, Audio, and Commentary

Multimodal human action recognition based on spatio-temporal action representation recognition model

Learning hierarchical video representation for action recognition

Multi-Modality Video Representation for Action Recognition