A video contains rich perceptual information, such as visual appearance, motion and audio, which can be used for understanding the activities in videos. Recent works have shown the combination of appearance (spatial) and motion (temporal) clues can significantly improve human action recognition performance in videos. To further explore the multimodal representation of video in action recognition, We propose a framework to learn a multimodal representations from video appearance, motion as well as audio data. Convolutional Neural Networks (CNN) are trained for each modality respectively. For fusing multiple features extracted with CNNs, we propose to add a fusion layer on the top of CNNs to learn a joint video representation. In fusion phase, we investigate both early fusion and late fusion with Neural Network and Support Vector Machine. Compare to existing works, (1) our work measures the benefits of taking audio information into consideration and (2) implements sophisticated fusion methods. The effectiveness of proposed approach is evaluated on UCF101 and UCF101-50 (selected subset in which each video contains audio data) for action recognition. The experimental results show that different modalities are complementary to each other and multimodal representation can be beneficial for final prediction. Furthermore, proposed fusion approach achieves 85.1% accuracy in fusing spatial-temporal on UCF101 (split 1), which is very competitive to state-of-the-art works.
Yan SongSheng TangYan-Tao ZhengTat‐Seng ChuaYongdong ZhangShouxun Lin
Takane KumakuraRyohei OriharaYasuyuki TaharaAkihiko OhsugaYuichi Sei
Qing LiZhaofan QiuTing YaoTao MeiYong RuiJiebo Luo
Chao ZhuYike WangDongbing PuMiao QiHui SunLei Tan