In this paper, we propose a multimodal multi-stream deep learning framework to tackle the egocentric activity recognition problem, using both the video and sensor data. First, we experiment and extend a multi-stream Convolutional Neural Network to learn the spatial and temporal features from egocentric videos. Second, we propose a multistream Long Short-Term Memory architecture to learn the features from multiple sensor streams (accelerometer, gyroscope, etc.). Third, we propose to use a two-level fusion technique and experiment different pooling techniques to compute the prediction results. Experimental results using a multimodal egocentric dataset show that our proposed method can achieve very encouraging performance, despite the constraint that the scale of the existing egocentric datasets is still quite limited.
Javed ImranBalasubramanian Raman
Yi HuangXiaoshan YangJunyu GaoJitao SangChangsheng Xu
Jinxing PanXiaoshan YangYi HuangChangsheng Xu
Mehmet Ali ArabacıFatih ÖzkanElif SürerPeter JančovičAlptekin Temizel
Yansong TangZian WangJiwen LuJianjiang FengJie Zhou