Yunfei MuMieradilijiang MaimaitiMiaomiao XuWenkai LiWushour Silamu
Scene text recognition has significant application value in autonomous driving, smart retail, and assistive devices. However, due to challenges such as multi-scale variations, distortions, and complex backgrounds, existing methods such as CRNN, ViT, and PARSeq, while showing good performance, still have room for improvement in feature extraction and semantic modeling capabilities. To address these issues, this paper proposes a novel scene text recognition model named the Encoder–Decoder Interactive Model (EDIM). Based on an encoder–decoder framework, EDIM introduces a Multi-scale Dilated Fusion Attention (MSFA) module in the encoder to enhance multi-scale feature representation. In the decoder, a Sequential Encoder–Decoder Context Fusion (SeqEDCF) mechanism is designed to enable efficient semantic interaction between the encoder and decoder. The effectiveness of the proposed method is validated on six regular and irregular benchmark test sets, as well as various subsets of the Union14M-L dataset. Experimental results demonstrate that EDIM outperforms state-of-the-art (SOTA) methods across multiple metrics, achieving significant performance gains, especially in recognizing irregular and distorted text.
Ling-Qun ZuoHong-Mei SunQi-Chao MaoRong QiRui‐Sheng Jia
Meiling LiXiumei LiJunmei SunYujin Dong
Zhi QiaoYu ZhouDongbao YangYucan ZhouWeiping Wang
Xitao MaKai HeDazhuang ZhangDashuang Li
Mengmeng CuiWei WangJinjin ZhangLiang Wang