Sequential Transformer for End-to-End Video Text Detection

Junbo Zhang; Mengbiao Zhao; Fei Yin; Cheng‐Lin Liu

doi:10.1109/wacv57701.2024.00639

ScienceGate Book Chapters

JOURNAL ARTICLE

Sequential Transformer for End-to-End Video Text Detection

Junbo Zhang Mengbiao Zhao Fei Yin Cheng‐Lin Liu

Year: 2024 Pages: 6506-6516

DOI: 10.1109/wacv57701.2024.00639

Get Full-Text PDF Get Analytical Report

Abstract

In existing methods of video text detection, the detection and tracking branches are usually independent of each other, and although they jointly optimize the backbone network, the tracking-by-detection paradigm still needs to be used during the inference stage. To address this issue, we propose a novel video text detection framework based on sequential transformer, which decodes detection and tracking tasks in parallel, without explicitly setting up a tracking branch. To achieve this, we first introduce the concept of instance query, which learns long-term context information in the video sequence. Then, based on the instance query, the transformer decoder is used to predict the entire box and mask sequence of the text instance in one pass. As a result, the tracking task is realized naturally. In addition, the proposed method can be applied to the scene text detection task seamlessly, without modifying any modules. To the best of our knowledge, this is the first framework to unify the tasks of scene text detection and video text detection. Our model achieves state-of-the-art performance on four video text datasets (YVT, RT-1K, BOVText, and BiRViT-1K), and competitive results on three scene text datasets (CTW1500, MSRA-TD500, and Total-Text). The code is available at https://github.com/zjb-1/SeqVideoText.

Keywords:

End-to-end principle Computer science Transformer Speech recognition Artificial intelligence Electrical engineering Engineering Voltage

Metrics

Cited By

1.06

FWCI (Field Weighted Citation Impact)

Refs

0.65

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Handwritten Text Recognition Techniques

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Video Analysis and Summarization

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Advanced Steganography and Watermarking Techniques

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Sequential Transformer for End-to-End Video Text Detection

Abstract

Metrics

Citation History

Topics

Related Documents

End-to-End Video Text Spotting with Transformer

End-to-End Video Violence Detection with Transformer

End-to-end video text detection with online tracking

VSTRD: An end-to-end video spatio-temporal relation detection transformer

Unsupervised End-to-End Transformer based approach for Video Anomaly Detection