Jin MiaoBing LuYanli ZhangTianfu HuangYinglong DiaoJiang ZhenyuDu BolunGaoning Nie
Effectively capturing both temporal and spatial features of human actions is fundamental to designing robust action recognition classifiers. In this study, we introduce an end-to-end dual-stream approach for human action recognition that leverages global and local feature representations in conjunction with conditional random fields. The proposed framework adopts a dual-stream network design, where spatial and temporal cues from video frames are initially extracted using the ViBe algorithm (enhanced with a flicker coefficient) and the unsupervised TV-Net, respectively. These features are separately fed into the corresponding spatial and temporal branches of the network for pre-training and subsequent feature extraction. A parallel fusion mechanism is then applied to integrate the outputs from both streams, thereby enriching the descriptive power of the learned features. For the final stage, an improved anisotropic Markov random field model is employed for network training and result refinement. Comprehensive experiments conducted on widely used datasets—UCF101, HMDB51—as well as a proprietary Fujian electric power measurement action dataset, demonstrate that the proposed method achieves superior robustness and high recognition accuracy compared to state-of-the-art techniques.
Dong ChenM. WuZ. TaoChuanqi Li
Wei SongPei YangNingning LiuGuosheng YangFuhong Lin
Shuo XuFeng ZhengJun TangWenxia Bao
Jianjun LiHongji XuJiaqi ZengWentao AiShijie LiXiaoman LiXinya Li