LAVT: Language-Aware Vision Transformer for Referring Image Segmentation

Zhao Yang; Jiaqi Wang; Yansong Tang; Kai Chen; Hengshuang Zhao; Philip Torr

doi:10.1109/cvpr52688.2022.01762

ScienceGate Book Chapters

JOURNAL ARTICLE

LAVT: Language-Aware Vision Transformer for Referring Image Segmentation

Zhao Yang Jiaqi Wang Yansong Tang Kai Chen Hengshuang Zhao Philip Torr

Year: 2022 Journal: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Pages: 18134-18144

DOI: 10.1109/cvpr52688.2022.01762

Get Full-Text PDF Get Analytical Report

Abstract

Referring image segmentation is a fundamental vision-language task that aims to segment out an object referred to by a natural language expression from an image. One of the key challenges behind this task is leveraging the referring expression for highlighting relevant positions in the image. A paradigm for tackling this problem is to leverage a powerful vision-language ("cross-madal") decoder to fuse features independently extracted from a vision encoder and a language encoder. Recent methods have made remarkable advancements in this paradigm by exploiting Transformers as cross-modal decoders, concurrent to the Transformer's overwhelming success in many other vision-language tasks. Adopting a different approach in this work, we show that significantly better cross-modal alignments can be achieved through the early fusion of linguistic and visual features in intermediate layers of a vision Transformer encoder network. By conducting cross-modal feature fusion in the visual feature encoding stage, we can leverage the well-proven correlation modeling power of a Transformer encoder for excavating helpful multi-modal context. This way, accurate segmentation results are readily harvested with a light-weight mask predictor. Without bells and whistles, our method surpasses the previous state-of-the-art methods on Ref CoCo, RefCOCO+, and G-Ref by large margins.

Keywords:

Computer science Computer vision Artificial intelligence Image segmentation Transformer Segmentation Engineering Electrical engineering

Metrics

279

Cited By

18.99

FWCI (Field Weighted Citation Impact)

Refs

0.99

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Multimodal Machine Learning Applications

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Advanced Neural Network Applications

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Advanced Image and Video Retrieval Techniques

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

LAVT: Language-Aware Vision Transformer for Referring Image Segmentation

Abstract

Metrics

Citation History

Topics

Related Documents

Language-Aware Vision Transformer for Referring Segmentation

Vision-Aware Language Reasoning for Referring Image Segmentation

LQMFormer: Language-Aware Query Mask Transformer for Referring Image Segmentation

SLViT: Scale-Wise Language-Guided Vision Transformer for Referring Image Segmentation

Cross-Aware Early Fusion With Stage-Divided Vision and Language Transformer Encoders for Referring Image Segmentation