Cross‐modal knowledge learning with scene text for fine‐grained image classification

Li Ting Xiong; Yingchi Mao; Zicheng Wang; Bingbing Nie; Chang Li

doi:10.1049/ipr2.13039

ScienceGate Book Chapters

JOURNAL ARTICLE

Cross‐modal knowledge learning with scene text for fine‐grained image classification

Li Ting Xiong Yingchi Mao Zicheng Wang Bingbing Nie Chang Li

Year: 2024 Journal: IET Image Processing Vol: 18 (6)Pages: 1447-1459 Publisher: Institution of Engineering and Technology

DOI: 10.1049/ipr2.13039

Get Full-Text PDF Get Analytical Report

Abstract

Abstract Scene text in natural images carries additional semantic information to aid in image classification. Existing methods lack full consideration of the deep understanding of the text and the visual text relationship, which results in the difficult to judge the semantic accuracy and the relevance of the visual text. This paper proposes image classification based on Cross modal Knowledge Learning of Scene Text (CKLST) method. CKLST consists of three stages: cross‐modal scene text recognition, text semantic enhancement, and visual‐text feature alignment. In the first stage, multi‐attention is used to extract features layer by layer, and a self‐mask‐based iterative correction strategy is utilized to improve the scene text recognition accuracy. In the second stage, knowledge features are extracted using external knowledge and are fused with text features to enhance text semantic information. In the third stage, CKLST realizes visual‐text feature alignment across attention mechanisms with a similarity matrix, thus the correlation between images and text can be captured to improve the accuracy of the image classification tasks. On Con‐Text dataset, Crowd Activity dataset, Drink Bottle dataset, and Synth Text dataset, CKLST can perform significantly better than other baselines on fine‐grained image classification, with improvements of 3.54%, 5.37%, 3.28%, and 2.81% over the best baseline in mAP, respectively.

Keywords:

Computer science Artificial intelligence Feature (linguistics) Similarity (geometry) Pattern recognition (psychology) Image (mathematics) Modal Relevance (law) Information retrieval Natural language processing

Metrics

Cited By

0.00

FWCI (Field Weighted Citation Impact)

Refs

0.02

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Topics

Handwritten Text Recognition Techniques

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Image Retrieval and Classification Techniques

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Multimodal Machine Learning Applications

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Cross‐modal knowledge learning with scene text for fine‐grained image classification

Abstract

Metrics

Topics

Related Documents

Multi-modal Knowledge-Enhanced Fine-Grained Image Classification

Fine-grained Feature Assisted Cross-modal Image-text Retrieval

Cross-Modal Knowledge Distillation For Fine-Grained One-Shot Classification

Correction to: Multi-modal Knowledge-Enhanced Fine-Grained Image Classification

Knowledge Mining with Scene Text for Fine-Grained Recognition