Object detection, a cornerstone of computer vision, has made significant strides, yet faces persistent challenges in generalizing to novel environments, object categories, and distribution shifts. Traditional supervised approaches, heavily reliant on large, meticulously annotated datasets, often struggle with robustness and adaptability when confronted with real-world complexities beyond their training distributions. This paper proposes a novel self-supervised pre-training paradigm for developing foundation models specifically tailored for generalizable object detection. Drawing inspiration from the success of large-scale pre-trained models in natural language processing and recent advancements in self-supervised learning for vision, we detail an architecture and pre-training strategy designed to learn robust, transferable object-centric representations from vast amounts of unlabeled or weakly labeled image data. Our methodology emphasizes masked autoencoding and contrastive learning techniques adapted to capture both holistic scene understanding and fine-grained object semantics. We outline the anticipated benefits of this paradigm, including superior performance in zero-shot, few-shot, and domain adaptation scenarios, reduced annotation dependency, and enhanced model robustness. This work aims to establish a theoretical and methodological framework for building next-generation object detectors capable of truly generalizable perception.
Charles RosenbergHebert, MartialSchneiderman, Henry
Oren ShroutOri NizanYizhak Ben-ShabatAyellet Tal
Yu ZhangTao ZhangHongyuan ZhuZihan ChenSiya MiXi PengXin Geng