Mengjie ZhongXihan WangLian-He ShaoQuanli Gao
ABSTRACT End‐to‐end scene text spotting has attracted considerable academic interest in recent years. However, due to complex environmental factors, text recognition remains a formidable challenge. In this paper, we introduce an end‐to‐end scene text spotting framework, referred to as DSNet. This framework comprises two principal modules: the text feature enhancement module (TFEM) for enhancing text regions and the redundant feature suppression module (RFSM) for noise suppression. Within the TFEM, we have designed multiple transformer layers for feature encoding; these layers are utilized to extract and enhance the feature representation of the text region. In the RFSM, we have designed a spatial reconstruction unit (SRU) and a channel reconstruction unit (CRU); these units effectively suppress irrelevant information through the feature reconstruction process. The proposed framework jointly optimizes text features by operating the TFEM and RFSM in parallel. The fused features from both modules are subsequently input to the decoder, enabling precise text area localization and robust character recognition. Extensive experiments demonstrate that our model achieves competitive performance in end‐to‐end scene text spotting, attaining an F‐measure of 90.2% on ICDAR2015, closely approaching the state‐of‐the‐art (91.0%).
Yirui WuLilai ZhangHao LiYunfei ZhangShaohua Wan
Zhen SongHuanshui ZhangPeng Cui
Guangcun WeiWansheng RongYongquan LiangXinguang XiaoXiang Liu
Ruturaj MahadshettiGuee-Sang LeeDeokjai Choi