Abstract

Existing referring understanding tasks tend to involve the detection of a single text-referred object. In this paper, we propose a new and general referring understanding task, termed referring multi-object tracking (RMOT). Its core idea is to employ a language expression as a semantic cue to guide the prediction of multi-object tracking. To the best of our knowledge, it is the first work to achieve an arbitrary number of referent object predictions in videos. To push forward RMOT, we construct one benchmark with scalable expressions based on KITTI, named Refer-KITTI. Specifically, it provides 18 videos with 818 expressions, and each expression in a video is annotated with an average of 10.7 objects. Further, we develop a transformer-based architecture TransRMOT to tackle the new task in an online manner, which achieves impressive detection performance and out-performs other counterparts. The Refer-KITTI dataset and the code are released at https://referringmot.github.io.

Keywords:
Computer science Artificial intelligence Construct (python library) Task (project management) Scalability Benchmark (surveying) Video tracking Object (grammar) Transformer Object detection Referent Expression (computer science) Computer vision Natural language processing Machine learning Segmentation Programming language

Metrics

68
Cited By
12.37
FWCI (Field Weighted Citation Impact)
84
Refs
0.99
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Multimodal Machine Learning Applications
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Speech and dialogue systems
Physical Sciences →  Computer Science →  Artificial Intelligence
Human Pose and Action Recognition
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition

Related Documents

JOURNAL ARTICLE

Cross-View Referring Multi-Object Tracking

Sijia ChenEn YuWenbing Tao

Journal:   Proceedings of the AAAI Conference on Artificial Intelligence Year: 2025 Vol: 39 (2)Pages: 2204-2211
JOURNAL ARTICLE

EchoTrack: Auditory Referring Multi-Object Tracking for Autonomous Driving

Jiacheng LinJiajun ChenKunyu PengXuan HeZhiyong LiRainer StiefelhagenKailun Yang

Journal:   IEEE Transactions on Intelligent Transportation Systems Year: 2024 Vol: 25 (11)Pages: 18964-18977
© 2026 ScienceGate Book Chapters — All rights reserved.