Yi LiSile MaZhenyu LiYizhong LuanZecui Jiang
This paper proposes an object-level feature memory module that utilizes attention mechanisms to explore spatial and temporal contexts in videos. Compared to still-image object detectors, video object detectors consider features in the spatiotemporal dimensions, leading to higher accuracy. How-ever, previous video object detection methods often focused on memory and fusion at the feature map level when integrating features across different frames. These approaches not only introduce significant computational and memory burdens but also introduces considerable noise. To address these challenges, we introduce object-level feature memory, which not only retains features from previous frames but also reduces memory and computational overhead, resulting in a substantial improvement in the performance of video object detectors. The experiments conducted on the UA-DETRAC dataset validate the effectiveness of our approach in live-stream video object detection scenarios. Our method achieved 66.73% AP based on YOLOX-S, which is 4.0% more AP than the normal YOLOX-S. Our codes are released at https://github.com/Liyi4578/0FMA.
Diwei FanHuicheng ZhengJisheng Dang
Masato FujitakeAkihiro Sugimoto
Chun-Han YaoFang ChenXiaohui ShenYangyue WanMing–Hsuan Yang
Yijun QianLijun YuWenhe LiuGuoliang KangAlexander G. Hauptmann
Zhengshuai WangYali LiShengjin Wang