Ning ZhangCe LiZongshun WangJialin MaZhiqiang Feng
Abstract Scene texts in nature exhibit varied colors, which serve as a significant distinguishing feature that effectively suppresses background interference. In this study, color clustering is utilized as a prior guide to group patches, enhancing their spatial relationships. Additionally, patch sizes are adaptively adjusted during training to balance speed and accuracy, while unimportant tokens and blocks in the model are pruned. We propose APViT, which modifies the ViTs model for scene text recognition requirements. It consists of three components: Sparse Patches Selection (SPS), ViT-STR, and Token Code (TC). First, SPS segments images into appropriate patches and clusters similar ones to explore diverse local patches adaptively. Second, we enhance the ViTs model specifically for scene text recognition as ViT-STR. Finally, TC prunes non-essential parts of the network based on self-attention mechanisms to accelerate performance. Consequently, our proposed APViT model outperforms state-of-the-art methods across several datasets, demonstrating its effectiveness.
Peng RenQingsong YuXuanqi WuZiyang Wang
M BasavannaPalaiahnakote ShivakumaraS. K. SrivatsaG. Hemantha Kumar
Xiaoxue ChenTianwei WangYuanzhi ZhuLianwen JinCanjie Luo