Recognizing text from images taken in natural scenes is a challenging task and a hot research topic in computer vision. Unlike traditional optical character recognition (OCR), words in natural images often possess irregular layout (e.g. arbitrarily orientation, blurring, perspective distortion) which are difficult to recognize. In this paper, we develop a novel method consisting of a text recognition network and a text correction component, which is more robust to irregular text. The text correction component rectify the text of an input image to a more "readable" text. The text recognition network is a more "location aware" attention-based sequence learning model that take the rectified image as input and recognize the text. The entire networks are trained jointly by only images and word-level annotations. The standard Softmax loss function only considers the separability between classes but does not restrict the aggregation within classes. Therefore, we adopt a new loss function based on the Softmax loss function to enable the model to learn more discriminative features, reduce misjudgments and improve accuracy. Extensive experiments on seven popular standard benchmarks, demonstrate the proposed method is comparable to state-of-the-art performance.
Tianlong MaXiangcheng DuYanlong WangXiu-Tao Cui
Yan WuJiaxin FanRenshuai TaoJiakai WangHaotong QinAishan LiuXianglong Liu