Jing WangHejiang LiCaihong Huangfu
In recent years, object detection models based on Transformers have demonstrated outstanding performance in general scenarios due to their powerful global feature modeling capabilities. However, when directly applied to aerial image detection tasks, their performance often falls short of expectations. The root cause lies in the nature of aerial imagery, which typically contains numerous small objects. These objects occupy an extremely low proportion of pixels, resulting in weak feature representation. They are also susceptible to factors such as complex background noise and mutual interference from densely distributed targets, making it difficult for Transformer models to effectively capture and distinguish small object features. To address these challenges, this paper proposes an enhanced Transformer architecture for aerial small object detection: Dynamic Interactive Fusion DETR (DIF-DETR). Its core innovations comprise two aspects: First, introducing the DIENet backbone feature extraction network embedded with DIEBlocks. These DIEBlocks serve as feature enhancement units within the backbone network, leveraging dynamic Inception multi-branch deep convolutions and adaptive weight allocation mechanisms to efficiently capture multi-scale, long-range contextual information. Second, it introduces Context-Aware Bidirectional Fusion (CABF), which enables adaptive complementary fusion of high-level semantic features and low-level detail features within the FPN-PAN architecture of the neck network, effectively mitigating the issue of small target features being obscured by background interference. Experimental results demonstrate that on the highly challenging VisDrone and HIT-UAV aerial datasets, the proposed DIF-DETR network outperforms existing mainstream models with 30.5% mAP and 82.3% mAPtest, respectively. Simultaneously, it significantly reduces computational cost to 43.6 GFLOPs with only 13.4M parameters, achieving an optimal balance between detection accuracy and computational efficiency. This demonstrates that through the synergistic effects of three core innovations, DIF-DETR significantly enhances detection accuracy and robustness for small objects in aerial images, providing an effective solution for object detection tasks in aerial scenarios.
Fufang LiYuehua ZhangYuxuan Fan
Zhiyang ChenYing YuChunping WangRenke KouJiaxuan MaGaoyuan Liu
Baoye SongShihao ZhaoZidong WangWeibo LiuXiaohui Liu
Xinyu CaoHanwei WangXiong WangBin Hu