JOURNAL ARTICLE

Multilevel Spatial-Temporal Feature Aggregation for Video Object Detection

Chao XuJiangning ZhangMengmeng WangGuanzhong TianYong Liu

Year: 2022 Journal:   IEEE Transactions on Circuits and Systems for Video Technology Vol: 32 (11)Pages: 7809-7820   Publisher: Institute of Electrical and Electronics Engineers

Abstract

Video object detection (VOD) focuses on detecting objects for each frame in a video, which is a challenging task due to appearance deterioration in certain video frames. Recent works usually distill crucial information from multiple support frames to improve the reference features, but they only perform at frame level or proposal level that cannot integrate spatial-temporal features sufficiently. To deal with this challenge, we treat VOD as a spatial-temporal hierarchical features interacting process and introduce a Multi-level Spatial-Temporal (MST) feature aggregation framework to fully exploit frame-level, proposal-level, and instance-level information in a unified framework. Specifically, MST first measures context similarity in pixel space to enhance all frame-level features rather than only update reference features. The proposal-level feature aggregation then models object relation to augment reference object proposals. Furthermore, to filter out irrelevant information from other classes and backgrounds, we introduce an instance ID constraint to boost instance-level features by leveraging support object proposal features that belong to the same object. Besides, we propose a Deformable Feature Alignment (DAlign) module before MST to achieve a more accurate pixel-level spatial alignment for better feature aggregation. Extensive experiments are conducted on ImageNet VID and UAVDT datasets that demonstrate the superiority of our method over state-of-the-art (SOTA) methods. Our method achieves 83.3% and 62.1% with ResNet-101 on two datasets, outperforming SOTA MEGA by 0.4% and 2.7%.

Keywords:
Computer science Artificial intelligence Feature (linguistics) Frame (networking) Object (grammar) Context (archaeology) Pixel Pattern recognition (psychology) Object detection Exploit Spatial contextual awareness Constraint (computer-aided design) Computer vision Similarity (geometry) Spatial analysis Image (mathematics) Mathematics

Metrics

30
Cited By
3.71
FWCI (Field Weighted Citation Impact)
86
Refs
0.93
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Advanced Neural Network Applications
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Advanced Image and Video Retrieval Techniques
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Video Surveillance and Tracking Methods
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition

Related Documents

JOURNAL ARTICLE

Temporal Context Enhanced Feature Aggregation for Video Object Detection

Fei HeNaiyu GaoQiaozhe LiSenyao DuXin ZhaoKaiqi Huang

Journal:   Proceedings of the AAAI Conference on Artificial Intelligence Year: 2020 Vol: 34 (07)Pages: 10941-10948
DISSERTATION

Real-Time Video Object Detection with Temporal Feature Aggregation

Meihong Chen

University:   uO Research (University of Ottawa) Year: 2021
JOURNAL ARTICLE

Temporal-adaptive sparse feature aggregation for video object detection

Fei HeQiaozhe LiXin ZhaoKaiqi Huang

Journal:   Pattern Recognition Year: 2022 Vol: 127 Pages: 108587-108587
© 2026 ScienceGate Book Chapters — All rights reserved.