Abstract

We propose a conceptually simple and thus fast multi-object tracking (MOT) model that does not require any attached modules, such as the Kalman filter, Hungarian algorithm, transformer blocks, or graph networks. Conventional MOT models are built upon the multi-step modules listed above, and thus the computational cost is high. Our proposed end-toend MOT model, TicrossNet, is composed of a base detector and a cross-attention module only. As a result, the overhead of tracking does not increase significantly even when the number of instances (N t ) increases. We show that TicrossNet runs in real-time; specifically, it achieves 32.6 FPS on MOT17 and 31.0 FPS on MOT20 (Tesla V100), which includes as many as >100 instances per frame. We also demonstrate that TicrossNet is robust to N t ; thus, it does not have to change the size of the base detector, depending on N t , as is often done by other models for real-time processing.

Keywords:
Computer science Frame (networking) Overhead (engineering) Detector Transformer Artificial intelligence Graph Base (topology) Kalman filter Tracking (education) Programming language Theoretical computer science Mathematics Computer network Engineering

Metrics

7
Cited By
1.27
FWCI (Field Weighted Citation Impact)
24
Refs
0.77
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Video Surveillance and Tracking Methods
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Advanced Chemical Sensor Technologies
Physical Sciences →  Engineering →  Biomedical Engineering
Advanced Image and Video Retrieval Techniques
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
© 2026 ScienceGate Book Chapters — All rights reserved.