Detecting scene text in natural images is a challenging task due to various text scales, uneven lighting, burring and perspective distortion. Retrospectively, text detection features are extracted from a single scale which is not sufficient enough since text regions are various in enormous sizes. To address the issue, we design a Feature Fusion Network to concatenate lower and higher level features to detect text regions of multiple scales. Besides, as it may incur learning bias due to the difference between scene text and general object when using models pre-trained on general object datasets, we extend DenseNet as a base network during the feature extraction stage to help to start training without pre-training. Our model enables an end-to-end scene text detector which can detect text regions of various scales without a pre-trained model. It achieves state-of-the-art results on ICDAR 2013 and COCO-Text benchmarks.
Zhen ZhuMinghui LiaoBaoguang ShiXiang Bai
Ling WangJing ZhangPeng WangYane Bai
Yirui WuLilai ZhangHao LiYunfei ZhangShaohua Wan
Xinhua LiuXiaokang ChenHailan KuangXiaolin Ma
Xuezhuan ZhaoZiheng ZhouLingling LiLishen PeiZhaoyi Ye