As a hot research area in computer vision in recent years, scene text recognition is still challenging due to the large variance in irregular text. The current methods treat the recognition process as a sequence-to-sequence task and solve it by an encoder-decoder framework. In this work, we propose a DMDAN for robust scene text recognition. First, we utilize deformable convolution to strengthen the ability to adapt to irregular text. Then, mix domain visual attention and self-attention are respectively employed in the encoder and decoder, which can effectively alleviate the problem of "attention drifting". Finally, we integrate the center loss to reduce the intra-class distances and make each class easier to distinguish. Extensive experimental results show that our model outperforms the baseline CRNN a lot and achieves a comparable performance against existing attention-based methods on both regular and irregular datasets.
Shuo XuZeming ZhuangMing-Jun LiFeng Su
Zhi QiaoXugong QinYu ZhouFei YangWeiping Wang