Bi-Modal Progressive Mask Attention for Fine-Grained Recognition

Kaitao Song; Xiu-Shen Wei; Xiangbo Shu; Renjie Song; Jianfeng Lu

doi:10.1109/tip.2020.2996736

ScienceGate Book Chapters

JOURNAL ARTICLE

Bi-Modal Progressive Mask Attention for Fine-Grained Recognition

Kaitao Song Xiu-Shen Wei Xiangbo Shu Renjie Song Jianfeng Lu

Year: 2020 Journal: IEEE Transactions on Image Processing Vol: 29 Pages: 7006-7018 Publisher: Institute of Electrical and Electronics Engineers

DOI: 10.1109/tip.2020.2996736

Get Full-Text PDF Get Analytical Report

Abstract

Traditional fine-grained image recognition is required to distinguish different subordinate categories (e.g., birds species) based on the visual cues beneath raw images. Due to both small inter-class variations and large intra-class variations, it is desirable to capture the subtle differences between these sub-categories, which is crucial but challenging for fine-grained recognition. Recently, language modality aggregation has been proved as a successful technique to improve visual recognition in the experience. In this paper, we introduce an end-to-end trainable Progressive Mask Attention (PMA) model for fine-grained recognition by leveraging both visual and language modalities. Our Bi-Modal PMA model can not only stage-by-stage capture the most discriminative part in the visual modality by our mask-based fashion, but also explore the out-of-visual-domain knowledge from the language modality in an interactional alignment paradigm. Specifically, at each stage, a self-attention module is proposed to attend to the key patch from images or text descriptions. Besides, a query-relational module is designed to seize the key words/phrases of texts and further bridge the connection between two modalities. Later, the learned representations of bi-modality from multiple stages are aggregated as the final features for recognition. Our Bi-Modal PMA model only needs raw images and raw text descriptions, without requiring bounding boxes/part annotations in images or key word annotations in texts. By conducting comprehensive experiments on fine-grained benchmark datasets, we demonstrate that the proposed method achieves superior performance over the competing baselines, on either vision and language bi-modality or single visual modality.

Keywords:

Computer science Modality (human–computer interaction) Discriminative model Artificial intelligence Modal Benchmark (surveying) Key (lock) Class (philosophy) Natural language processing Modalities Bounding overwatch Pattern recognition (psychology) Speech recognition

Metrics

Cited By

4.51

FWCI (Field Weighted Citation Impact)

Refs

0.95

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Multimodal Machine Learning Applications

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Domain Adaptation and Few-Shot Learning

Physical Sciences → Computer Science → Artificial Intelligence

Advanced Image and Video Retrieval Techniques

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Bi-Modal Progressive Mask Attention for Fine-Grained Recognition

Abstract

Metrics

Citation History

Topics

Related Documents

Two-Level Progressive Attention Convolutional Network for Fine-Grained Image Recognition

Progressive Training Enabled Fine-Grained Recognition

DCMA-Net: dual cross-modal attention for fine-grained few-shot recognition

Bi-Channel Attention Meta Learning for Few-Shot Fine-Grained Image Recognition

Bi-channel attention meta learning for few-shot fine-grained image recognition