Video object detection is a tough task due to the deteriorated quality of\nvideo sequences captured under complex environments. Currently, this area is\ndominated by a series of feature enhancement based methods, which distill\nbeneficial semantic information from multiple frames and generate enhanced\nfeatures through fusing the distilled information. However, the distillation\nand fusion operations are usually performed at either frame level or instance\nlevel with external guidance using additional information, such as optical flow\nand feature memory. In this work, we propose a dual semantic fusion network\n(abbreviated as DSFNet) to fully exploit both frame-level and instance-level\nsemantics in a unified fusion framework without external guidance. Moreover, we\nintroduce a geometric similarity measure into the fusion process to alleviate\nthe influence of information distortion caused by noise. As a result, the\nproposed DSFNet can generate more robust features through the multi-granularity\nfusion and avoid being affected by the instability of external guidance. To\nevaluate the proposed DSFNet, we conduct extensive experiments on the ImageNet\nVID dataset. Notably, the proposed dual semantic fusion network achieves, to\nthe best of our knowledge, the best performance of 84.1\\% mAP among the current\nstate-of-the-art video object detectors with ResNet-101 and 85.4\\% mAP with\nResNeXt-101 without using any post-processing steps.\n
Tianxiang HouQiang QiYang LuKaiwen DuHanzi Wang
Wan‐Qing YuJing YUXinqi ShiChuangbai Xiao
Ye TianYang LiuMengyu YangLanshan ZhangZhigang Li