JOURNAL ARTICLE

Fianet: Video Object Detection Via Joint Feature-Level And Instance-Level Aggregation

Abstract

Video object detection task is challenging due to the nonrigid and rigid appearance deformations in videos. Most of the typical competitive methods are to enhance per-frame features through aggregating lots of previous and future frames. But feature-level aggregation isn't robust to rigid deformations such as occlusion and rare postures. In this paper, we propose an online video object detection method with joint feature-level aggregation and instance-level aggregation network (FIANet). Besides feature-level aggregation, we design a spatial-temporal instance calibration module (STIC) to aggregate the instance as a whole, which can reduce the interference of local distorted and missed pixels. Joint featurelevel and instance-level aggregation can work collaboratively to overcome different deformations. Only using less previous frames, our method can achieve 81.6% mAP with relatively high speed on ImageNet VID, which is state-of-the-art compared with causal and non-causal methods.

Keywords:
Computer science Artificial intelligence Aggregate (composite) Feature (linguistics) Joint (building) Frame (networking) Computer vision Object (grammar) Object detection Pixel Pattern recognition (psychology) Feature extraction Interference (communication) Engineering

Metrics

1
Cited By
0.10
FWCI (Field Weighted Citation Impact)
27
Refs
0.38
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Advanced Neural Network Applications
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Video Surveillance and Tracking Methods
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Human Pose and Action Recognition
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
© 2026 ScienceGate Book Chapters — All rights reserved.