Local visual and long-range contextual features yield two complementary cues for human reading text in natural scene. Existing scene text recognition methods mainly extract local features at a low level and then model long-range dependencies at a high level, this sequential pipeline may be sub-optimal to construct complete and effective representation. Except for high-level features, long-range contextual relation is of importance in low-level features as well since it can help separate different characters based on the intervals between characters and thus enhance the character features. To address this issue, we develop a dual relation module to extract complementary features in a parallel manner for scene text recognition, which consists of a local visual branch and a long-range contextual branch. The local visual branch employs a topological-aware operation to model intra-character characteristic and extract discriminative features of different characters. Meanwhile, the long-range contextual branch utilizes a simple but effective strategy to incorporate inter-character relations into feature maps. Our dual relation module is a plug-and-play block which can be easily incorporated into modern deep architectures. Experimental results demonstrate that our methods achieved top performance on several standard benchmarks. Code and models will become publicly available in the future.
Junwei ZhouHongchao GaoJiao DaiDongqin LiuJizhong Han
Xinjian GaoYe PangYuyu LiuJun YuMaokun HanKai HouWei Wang
Shuo LiAnqi HanXu ChenXiaoqing YinJun Zhang
Yijie HuBin DongKaizhu HuangLei DingWei WangXiaowei HuangQiufeng Wang