Language-Aware Vision Transformer for Referring Segmentation

Zhao Yang; Jiaqi Wang; Xubing Ye; Yansong Tang; Kai Chen; Hengshuang Zhao; Philip H. S. Torr

doi:10.1109/tpami.2024.3468640

ScienceGate Book Chapters

JOURNAL ARTICLE

Language-Aware Vision Transformer for Referring Segmentation

Zhao Yang Jiaqi Wang Xubing Ye Yansong Tang Kai Chen Hengshuang Zhao Philip H. S. Torr

Year: 2024 Journal: IEEE Transactions on Pattern Analysis and Machine Intelligence Vol: 47 (7)Pages: 5238-5255 Publisher: IEEE Computer Society

DOI: 10.1109/tpami.2024.3468640

Get Full-Text PDF Get Analytical Report

Abstract

Referring segmentation is a fundamental vision-language task that aims to segment out an object from an image or video in accordance with a natural language description. One of the key challenges behind this task is leveraging the referring expression for highlighting relevant positions in the image or video frames. A paradigm for tackling this problem in both the image and the video domains is to leverage a powerful vision-language ("cross-modal") decoder to fuse features independently extracted from a vision encoder and a language encoder. Recent methods have made remarkable advances in this paradigm by exploiting Transformers as cross-modal decoders, concurrent to the Transformer's overwhelming success in many other vision-language tasks. Adopting a different approach in this work, we show that significantly better cross-modal alignments can be achieved through the early fusion of linguistic and visual features in intermediate layers of a vision Transformer encoder network. Based on the idea of conducting cross-modal feature fusion in the visual feature encoding stage, we propose a unified framework named Language-Aware Vision Transformer (LAVT), which leverages the well-proven correlation modeling power of a Transformer encoder for excavating helpful multi-modal context. This way, accurate segmentation results can be harvested with a light-weight mask predictor. One of the key components in the proposed system is a dense attention mechanism for collecting pixel-specific linguistic cues. When dealing with video inputs, we present the video LAVT framework and design a 3D version of this component by introducing multi-scale convolutional operators arranged in a parallel fashion, which can exploit spatio-temporal dependencies at different granularity levels. We further introduce unified LAVT as a unified framework that could handle both image and video inputs with enhanced segmentation capability on unified referring segmentation task. Our methods surpass previous state-of-the-art methods on seven benchmarks for referring image segmentation and referring video segmentation. The code to reproduce our experiments is available at LAVT-RS.

Keywords:

Computer science Artificial intelligence Computer vision Segmentation Image segmentation Natural language processing Transformer Machine vision Pattern recognition (psychology) Engineering

Metrics

Cited By

3.71

FWCI (Field Weighted Citation Impact)

Refs

0.89

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Advanced Image and Video Retrieval Techniques

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Robotics and Automated Systems

Physical Sciences → Engineering → Control and Systems Engineering

Language-Aware Vision Transformer for Referring Segmentation

Abstract

Metrics

Citation History

Topics

Related Documents

LAVT: Language-Aware Vision Transformer for Referring Image Segmentation

Vision-Aware Language Reasoning for Referring Image Segmentation

Vision-Language Transformer and Query Generation for Referring Segmentation

LQMFormer: Language-Aware Query Mask Transformer for Referring Image Segmentation

VLT: Vision-Language Transformer and Query Generation for Referring Segmentation