JOURNAL ARTICLE

Sequential Transformer for End-to-End Video Text Detection

Abstract

In existing methods of video text detection, the detection and tracking branches are usually independent of each other, and although they jointly optimize the backbone network, the tracking-by-detection paradigm still needs to be used during the inference stage. To address this issue, we propose a novel video text detection framework based on sequential transformer, which decodes detection and tracking tasks in parallel, without explicitly setting up a tracking branch. To achieve this, we first introduce the concept of instance query, which learns long-term context information in the video sequence. Then, based on the instance query, the transformer decoder is used to predict the entire box and mask sequence of the text instance in one pass. As a result, the tracking task is realized naturally. In addition, the proposed method can be applied to the scene text detection task seamlessly, without modifying any modules. To the best of our knowledge, this is the first framework to unify the tasks of scene text detection and video text detection. Our model achieves state-of-the-art performance on four video text datasets (YVT, RT-1K, BOVText, and BiRViT-1K), and competitive results on three scene text datasets (CTW1500, MSRA-TD500, and Total-Text). The code is available at https://github.com/zjb-1/SeqVideoText.

Keywords:
End-to-end principle Computer science Transformer Speech recognition Artificial intelligence Electrical engineering Engineering Voltage

Metrics

2
Cited By
1.06
FWCI (Field Weighted Citation Impact)
63
Refs
0.65
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Handwritten Text Recognition Techniques
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Video Analysis and Summarization
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Advanced Steganography and Watermarking Techniques
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition

Related Documents

JOURNAL ARTICLE

End-to-End Video Text Spotting with Transformer

Weijia WuYuanqiang CaiChunhua ShenDebing ZhangYing FuHong ZhouPing Luo

Journal:   International Journal of Computer Vision Year: 2024 Vol: 132 (9)Pages: 4019-4035
JOURNAL ARTICLE

End-to-End Video Violence Detection with Transformer

L. P. Zhou

Journal:   2022 5th International Conference on Pattern Recognition and Artificial Intelligence (PRAI) Year: 2022 Pages: 880-884
JOURNAL ARTICLE

End-to-end video text detection with online tracking

Hongyuan YuYan HuangLihong PiChengquan ZhangXuan LiLiang Wang

Journal:   Pattern Recognition Year: 2021 Vol: 113 Pages: 107791-107791
© 2026 ScienceGate Book Chapters — All rights reserved.