Attention and Language Ensemble for Scene Text Recognition with Convolutional Sequence Modeling

Shancheng Fang; Hongtao Xie; Zheng-Jun Zha; Nannan Sun; Jianlong Tan; Yongdong Zhang

doi:10.1145/3240508.3240571

ScienceGate Book Chapters

JOURNAL ARTICLE

Attention and Language Ensemble for Scene Text Recognition with Convolutional Sequence Modeling

Shancheng Fang Hongtao Xie Zheng-Jun Zha Nannan Sun Jianlong Tan Yongdong Zhang

Year: 2018 Pages: 248-256

DOI: 10.1145/3240508.3240571

Get Full-Text PDF Get Analytical Report

Abstract

Recent dominant approaches for scene text recognition are mainly based on convolutional neural network (CNN) and recurrent neural network (RNN), where the CNN processes images and the RNN generates character sequences. Different from these methods, we propose an attention-based architecture1 which is completely based on CNNs. The distinctive characteristics of our method include: (1) the method follows encoder-decoder architecture, in which the encoder is a two-dimensional residual CNN and the decoder is a deep one-dimensional CNN. (2) An attention module that captures visual cues, and a language module that models linguistic rules are designed equally in the decoder. Therefore the attention and language can be viewed as an ensemble to boost predictions jointly. (3) Instead of using a single loss from language aspect, multiple losses from attention and language are accumulated for training the networks in an end-to-end way. We conduct experiments on standard datasets for scene text recognition, including Street View Text, IIIT5K and ICDAR datasets. The experimental results show our CNN-based method has achieved state-of-the-art performance on several benchmark datasets, even without the use of RNN.

Keywords:

Computer science Benchmark (surveying) Convolutional neural network Artificial intelligence Recurrent neural network Encoder Language model Pattern recognition (psychology) Speech recognition Deep learning Natural language processing Artificial neural network

Metrics

Cited By

7.07

FWCI (Field Weighted Citation Impact)

Refs

0.97

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Handwritten Text Recognition Techniques

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Advanced Image and Video Retrieval Techniques

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Image Retrieval and Classification Techniques

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Attention and Language Ensemble for Scene Text Recognition with Convolutional Sequence Modeling

Abstract

Metrics

Citation History

Topics

Related Documents

Convolutional Attention Networks for Scene Text Recognition

Reading scene text with fully convolutional sequence modeling

Arbitrary-Shaped Scene Text Recognition with Deformable Ensemble Attention

An Attention-based Sequence Learning Model for Scene Text Recognition with Text Correction

FDTA: Fully Convolutional Scene Text Detection With Text Attention