Scene Text Recognition (STR) has long been considered an important yet challenging task in the field of computer vision. Recent works have demonstrated that utilizing language information is effective for the visually difficult images, like ones with occultation or blurring. However, the use of language information sometimes leads to the over-correction problem. For out-of-vocabulary samples (e.g. "hou" and "0x4a"), some methods have tended to be biased to language side and over-corrected (e.g. over-correct "hou" to "hot"). This imbalance of vision and language has limited the usage of models in practical scenarios, yet it is rarely occurs for human. To address this issue, we rethink the human's recognition process and propose a model behaving in the order of "Read, Spell and Repeat". It refines the recognition process circularly with vision and language information. With this mechanism, our model integrates vision and language information in a more effective manner, achieving higher accuracy with less parameters compared to baseline and competitive performance with SOTA methods in the standard benchmarks.
Shaocong TianRize JinJoon‐Young PaikChen Fu
Xiaodong YuCornelia FermüllerChing L. TeoYezhou YangYiannis Aloimonos
Shuai ZhaoRuijie QuanLinchao ZhuYi Yang
Jie RenTao DengYanlin HuangDong QuJiahao SuBingen Li