Xinjian GaoYe PangYuyu LiuJun YuMaokun HanKai HouWei Wang
Scene text recognition, especially irregular text recognition, is a challenging task due to the large variance in text appearance. Although some existing methods have achieved state-of-the-art performance with the attention-based encoder-decoder framework, they always perform poorly on some challenging text such as severely curved, blurred, and incomplete-semantic text. To address these issues, we propose a Dual-Branch Cross-Attention Network (DBCAN). Different from the previous methods heavily relying on semantic information, DBCAN can enhance the position clues and learn semantic relations with two separate branches and fuse them by a tailored Cross-Attention Module (CAM). Furthermore, a Convolution-Based 2D Positional Embedding (CBPE) is introduced to describe the 2D spatial dependencies of characters. Extensive experiments demonstrate our DBCAN is more accurate and robust than the previous methods and achieves state-of-the-art performance on several benchmarks, particularly CUTE (93.4%). Our code is made publicly available at https://github.com/GaoXinJian-USTC/DBCAN.
Ronghua JiangZhandong LiuKe LiLu Liang
Yijie HuBin DongKaizhu HuangLei DingWei WangXiaowei HuangQiufeng Wang