Jiaze GuLulin FanJing ZhaoXianghai Cao
Conventional remote sensing object detection models are predominantly constrained by their dependence on closed-set annotations, limiting their capability to accurately detect objects that were not present in the training dataset. To enable the model to generalize to novel object categories, we propose CoseDet, a novel open-vocabulary object detection framework that integrates the semantic richness of visual-language pretraining model with the precise positioning capability of modern detection architectures. Specifically, CoseDet augments a faster R-CNN detector with a ResNet50-FPN backbone by incorporating RemoteCLIP-based embeddings through a pseudoword mechanism, which aligns high-dimensional visual features with robust textual semantics. Furthermore, a convolutional block attention module is employed to refine feature representations, and explicit modeling of surrounding regions is utilized to capture crucial contextual dependencies. Comprehensive experiments based on four datasets demonstrate that CoseDet not only outperforms state-of-the-art methods but also provides a robust and generalizable solution for open-vocabulary object detection in complex remote sensing scenarios.
Yongshuo ZhuLu LiKexin ChenChenyang LiuFugen ZhouZhenwei Shi
C.H. YeYunzhi ZhugePingping Zhang
Wei TangHuo JinyuShihao ZhaoZhu Qixing
Xiaozhe LiSihang DangYifei SunXiaoyue JiangShuliang GuiXiaoyi Feng