Chen ZhuWeihai LiChi FeiBin LiuNenghai Yu
Video object detection is a challenging problem in computer vision. In this paper, we propose a novel spatial-temporal feature aggregation network to deal with this issue. Specifically, we present a novel instance-level feature aggregation module as complementary to traditional pixel-level feature aggregation, in which we build a new movement estimation module to learn instance movements across frames. Then the Graph Convolutional Networks (GCNs) is applied to obtain temporal relation among instances over frames to implement instance-level feature aggregation. At last, we combine pixel-level and instance-level features by learnable soft weights to make use of their complementary information. Our framework is simple to implement and enables end-to-end training, which achieves state-of-art performance on the ImageNet VID dataset by extensive experiments.
Chao XuJiangning ZhangMengmeng WangGuanzhong TianYong Liu
Shangdong ZhengZebin WuYang XuPengfei LiuPeng ZhengZhihui Wei
Fei HeNaiyu GaoQiaozhe LiSenyao DuXin ZhaoKaiqi Huang
Fei HeQiaozhe LiXin ZhaoKaiqi Huang