JOURNAL ARTICLE

Bi-Modal Progressive Mask Attention for Fine-Grained Recognition

Kaitao SongXiu-Shen WeiXiangbo ShuRenjie SongJianfeng Lu

Year: 2020 Journal:   IEEE Transactions on Image Processing Vol: 29 Pages: 7006-7018   Publisher: Institute of Electrical and Electronics Engineers

Abstract

Traditional fine-grained image recognition is required to distinguish different subordinate categories (e.g., birds species) based on the visual cues beneath raw images. Due to both small inter-class variations and large intra-class variations, it is desirable to capture the subtle differences between these sub-categories, which is crucial but challenging for fine-grained recognition. Recently, language modality aggregation has been proved as a successful technique to improve visual recognition in the experience. In this paper, we introduce an end-to-end trainable Progressive Mask Attention (PMA) model for fine-grained recognition by leveraging both visual and language modalities. Our Bi-Modal PMA model can not only stage-by-stage capture the most discriminative part in the visual modality by our mask-based fashion, but also explore the out-of-visual-domain knowledge from the language modality in an interactional alignment paradigm. Specifically, at each stage, a self-attention module is proposed to attend to the key patch from images or text descriptions. Besides, a query-relational module is designed to seize the key words/phrases of texts and further bridge the connection between two modalities. Later, the learned representations of bi-modality from multiple stages are aggregated as the final features for recognition. Our Bi-Modal PMA model only needs raw images and raw text descriptions, without requiring bounding boxes/part annotations in images or key word annotations in texts. By conducting comprehensive experiments on fine-grained benchmark datasets, we demonstrate that the proposed method achieves superior performance over the competing baselines, on either vision and language bi-modality or single visual modality.

Keywords:
Computer science Modality (human–computer interaction) Discriminative model Artificial intelligence Modal Benchmark (surveying) Key (lock) Class (philosophy) Natural language processing Modalities Bounding overwatch Pattern recognition (psychology) Speech recognition

Metrics

73
Cited By
4.51
FWCI (Field Weighted Citation Impact)
76
Refs
0.95
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Multimodal Machine Learning Applications
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Domain Adaptation and Few-Shot Learning
Physical Sciences →  Computer Science →  Artificial Intelligence
Advanced Image and Video Retrieval Techniques
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition

Related Documents

JOURNAL ARTICLE

Progressive Training Enabled Fine-Grained Recognition

Bin KangFan WuXin LiQuan Zhou

Journal:   2022 IEEE International Conference on Image Processing (ICIP) Year: 2022
JOURNAL ARTICLE

DCMA-Net: dual cross-modal attention for fine-grained few-shot recognition

Yan ZhouXiao RenJianxun LiYin YangHaibin Zhou

Journal:   Multimedia Tools and Applications Year: 2023 Vol: 83 (5)Pages: 14521-14537
JOURNAL ARTICLE

Bi-channel attention meta learning for few-shot fine-grained image recognition

Yao WangYang JiWei Chang WangBailing Wang

Journal:   Expert Systems with Applications Year: 2023 Vol: 242 Pages: 122741-122741
© 2026 ScienceGate Book Chapters — All rights reserved.