Action-Stage Emphasized Spatiotemporal VLAD for Video Action Recognition

Zhigang Tu; Hongyan Li; Dejun Zhang; Justin Dauwels; Baoxin Li; Junsong Yuan

doi:10.1109/tip.2018.2890749

ScienceGate Book Chapters

JOURNAL ARTICLE

Action-Stage Emphasized Spatiotemporal VLAD for Video Action Recognition

Zhigang Tu Hongyan Li Dejun Zhang Justin Dauwels Baoxin Li Junsong Yuan

Year: 2019 Journal: IEEE Transactions on Image Processing Vol: 28 (6)Pages: 2799-2812 Publisher: Institute of Electrical and Electronics Engineers

DOI: 10.1109/tip.2018.2890749

Get Full-Text PDF Get Analytical Report

Abstract

Despite outstanding performance in image recognition, convolutional neural networks (CNNs) do not yet achieve the same impressive results on action recognition in videos. This is partially due to the inability of CNN for modeling long-range temporal structures especially those involving individual action stages that are critical to human action recognition. In this paper, we propose a novel action-stage (ActionS) emphasized spatiotemporal Vector of Locally Aggregated Descriptors (ActionS-STVLAD) method to aggregate informative deep features across the entire video according to adaptive video feature segmentation and adaptive segment feature sampling (AVFS-ASFS). In our ActionSST- VLAD encoding approach, by using AVFS-ASFS, the key frame features are chosen and the corresponding deep features are automatically split into segments with the features in each segment belonging to a temporally coherent ActionS. Then, based on the extracted key frame feature in each segment, a flow-guided warping technique is introduced to detect and discard redundant feature maps, while the informative ones are aggregated by using our exploited similarity weight. Furthermore, we exploit an RGBF modality to capture motion salient regions in the RGB images corresponding to action activity. Extensive experiments are conducted on four public benchmarks - HMDB51, UCF101, Kinetics and ActivityNet for evaluation. Results show that our method is able to effectively pool useful deep features spatiotemporally, leading to state-of-the-art performance for videobased action recognition.

Keywords:

Computer science Artificial intelligence Pattern recognition (psychology) Feature (linguistics) RGB color model Convolutional neural network Feature extraction Optical flow Segmentation Action recognition Computer vision Image (mathematics)

Metrics

166

Cited By

9.94

FWCI (Field Weighted Citation Impact)

Refs

0.98

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Human Pose and Action Recognition

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Video Surveillance and Tracking Methods

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Multimodal Machine Learning Applications

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Action-Stage Emphasized Spatiotemporal VLAD for Video Action Recognition

Abstract

Metrics

Citation History

Topics

Related Documents

Action Recognition with Uncertain VLAD

Action Recognition Using Hybrid Feature Descriptor and VLAD Video Encoding

Video Action Recognition Based on Spatiotemporal Sampling

Spatiotemporal Multiplier Networks for Video Action Recognition

Spatiotemporal Pyramid Network for Video Action Recognition