Yingying ChenYanfang WangChang LiQ. LiQian Huang
RGB video-based action recognition has many application scenarios due to its rich and abundant appearance information for accurate and robust performance. In recent years, convolutional neural networks have been rapidly developed and have made effective achievements in the field of action recognition. However, they cannot adequately extract fine-grained information. It is difficult to effectively complement learning spatio-temporal information even when utilizing two modalities. In this paper, we propose a dual-stream multi-scale fusion method. The method constructs different fine-grained representations of key features through key feature extraction module and near-by fusion to further extract and enhance the multi-scale information. In the multi-scale cross fusion, we utilize temporal gradients of motion information to interact with RGB videos to enhance modal complementarity. The final result fuses multi-scale representations within modalities and higher-order similarities between modalities, showing fine-grained learning of appearance and motion. Compared to other commonly used methods, the algorithm proposed in this paper shows significant improvement on the UCF101 and HMDB51 dataset, achieving 94.12% and 72.55% accuracy, respectively.
Lifei SongLiguo WengLingfeng WangMin Xia
Dong ChenM. WuZ. TaoChuanqi Li
Jigang XieKai WanCheng TianJunhuai LiShuai HuHuaijun Wang
Saurabh ChopraLi ZhangMing Jiang
Panzi ZhaoYue MingNannan HuBoyang LyuJiangwan Zhou