This thesis addresses the challenging problem of cursive text, specifically Urdu text detection and recognition in natural scene images. Developing robust text spotting systems for such languages are more complex and challenging than for non-cursive languages such as English. This is mainly due to the language complexities and several challenges associated with cursive text in natural images. Different machine learning and deep learning based methods have been proposed to detect and recognize Urdu text while considering variations in text (different font sizes, colors, writing styles, aspect ratios) and environmental factors (blur, occlusion, un-even lighting, etc.). Three different deep learning approaches have been proposed throughout this research work. The first approach combines multi-scale and multi-level features of a Convolutional Neural Network (CNN) to recognize isolated Urdu character images. A nearest interpolation method is proposed to normalize the spatial dimensions of feature maps. An up-sampling and element-wise addition operation is used to aggregate multi-scale features and pass these to a multi-level feature fusion network. Finally, the aggregated and multi-level features are combined to create a more powerful feature set. The proposed method improves Urdu character recognition accuracy when compared to sequential CNN and machine learning methods. In the second approach, a segmentation free method is proposed to recognize cropped Urdu word image text. This problem is considered as a sequence-to-sequence recognition task, where three deep CNNs are used to extract the relevant features. A recurrent neural network with a connectionist temporal classification method is used to perform label sequencing without prior segmentation of the text into individual characters. To further improve the accuracy, a novel method using a VGG-16 network with shortcut connections that outperforms other methods was developed. In the last approach, a deep transfer learning method is proposed with a hybrid feature fusion method. The features of a deep VGG-16 network, pre-trained on ImageNet data, are first added then concatenated at different scales and depths. A 3x3 sliding-window is applied to the multi-level concatenated feature maps that takes a convolutional feature of 3 x 3 x C to make the predictions. For each prediction, a text proposal generation method is defined that generates fixed vertical anchors at 10 different heights, while keeping the width of each text proposal fixed to 16 pixels. A recurrent layer with a bi-directional gated recurrent unit network defined within the convolutional network takes the convolutional feature map at each sliding window location and generates text/non-text and y-coordinate prediction scores for each text proposal. In the proposed framework, feature extraction, feature fusion and the recurrent layers are trained simultaneously in an end-to-end fashion. Considering the unavailability of public datasets for Urdu text detection and recognition in natural images, three separate datasets have been created and made publicly available for research purpose. The first dataset consists of large-scale segmented character images of individual Urdu letters in their different shapes (isolated, initial, medial and final). The second dataset consists of cropped Urdu word text images with their annotations and lexicon text files. In the last dataset, all text instances in a whole natural scene image are annotated. A separate text file with UTF-8 encoding is created for each of the images to store word-level bounding box coordinates. This dataset has several images with bi-lingual, tri-lingual and handwritten text written on walls or signboards. To the best of our knowledge, these are the first datasets to be published for Urdu text detection and recognition in natural scene images.
Vijay Kumar GugulothuDr.Ramu Vankudoth
Saad Bin AhmedSaeeda NazImran RazzakRubiyah Yusof
Asghar Ali ChandioMd. AsikuzzamanMark R. PickeringMehwish Leghari
Syed Yasser ArafatMuhammad Iqbal