Cross-Lingual Text Image Recognition via Multi-Hierarchy Cross-Modal Mimic

Zhuo Chen; Fei Yin; Qing Yang; Cheng‐Lin Liu

doi:10.1109/tmm.2022.3183386

ScienceGate Book Chapters

JOURNAL ARTICLE

Cross-Lingual Text Image Recognition via Multi-Hierarchy Cross-Modal Mimic

Zhuo Chen Fei Yin Qing Yang Cheng‐Lin Liu

Year: 2022 Journal: IEEE Transactions on Multimedia Vol: 25 Pages: 4830-4841 Publisher: Institute of Electrical and Electronics Engineers

DOI: 10.1109/tmm.2022.3183386

Get Full-Text PDF Get Analytical Report

Abstract

Optical character recognition and machine translation are usually studied and applied separately. In this paper, we consider a new problem named cross-lingual text image recognition (CLTIR) that integrates these two tasks together. The core of this problem is to recognize source language texts shown in images and transcribe them to the target language in an end-to-end manner. Traditional cascaded systems perform text image recognition and text translation sequentially. This can lead to error accumulation and parameter redundancy problems. To overcome these problems, we propose a multihierarchy cross-modal mimic (MHCMM) framework for end-to-end CLTIR, which can be trained with a massive bilingual text corpus and a small number of bilingual annotated text images. In this framework, a plug-in machine translation model is used as a teacher to guide the CLTIR model for learning representations compatible with image and text modes. Via adversarial learning and attention mechanisms, the proposed mimic method can integrate both global and local information in the semantic space. Experiments on a newly collected dataset demonstrate the superiority of the proposed framework. Our method outperforms other pipelines while containing fewer parameters. Additionally, the MHCMM framework can utilize a large-scale bilingual corpus to further improve the performance efficiently. The visualization of attention scores indicates that the proposed model can read text images in a fashion similar to the machine translation model reading text tokens.

Keywords:

Computer science Artificial intelligence Machine translation Redundancy (engineering) Natural language processing Text corpus Optical character recognition Visualization Language model Pattern recognition (psychology) Speech recognition Image (mathematics)

Metrics

Cited By

2.10

FWCI (Field Weighted Citation Impact)

Refs

0.86

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Handwritten Text Recognition Techniques

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Advanced Image and Video Retrieval Techniques

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Image Retrieval and Classification Techniques

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Cross-Lingual Text Image Recognition via Multi-Hierarchy Cross-Modal Mimic

Abstract

Metrics

Citation History

Topics

Related Documents

Cross-Lingual Text Image Recognition via Multi-Task Sequence to Sequence Learning

CCIM: Cross-modal Cross-lingual Interactive Image Translation

Cross2StrA: Unpaired Cross-lingual Image Captioning with Cross-lingual Cross-modal Structure-pivoted Alignment

Improving Cross-domain, Cross-lingual and Multi-modal Deception Detection

Multi-modal Correlated Centroid Space for Multi-lingual Cross-Modal Retrieval