You Only Infer Once: Cross-Modal Meta-Transfer for Referring Video Object Segmentation

Dezhuang Li; Ruoqi Li; Lijun Wang; Yifan Wang; Jinqing Qi; Lu Zhang; Ting Liu; Qingquan Xu; Huchuan Lu

doi:10.1609/aaai.v36i2.20017

ScienceGate Book Chapters

JOURNAL ARTICLE

You Only Infer Once: Cross-Modal Meta-Transfer for Referring Video Object Segmentation

Dezhuang Li Ruoqi Li Lijun Wang Yifan Wang Jinqing Qi Lu Zhang Ting Liu Qingquan Xu Huchuan Lu

Year: 2022 Journal: Proceedings of the AAAI Conference on Artificial Intelligence Vol: 36 (2)Pages: 1297-1305 Publisher: Association for the Advancement of Artificial Intelligence

DOI: 10.1609/aaai.v36i2.20017

Get Full-Text PDF Get Analytical Report

Abstract

We present YOFO (You Only inFer Once), a new paradigm for referring video object segmentation (RVOS) that operates in an one-stage manner. Our key insight is that the language descriptor should serve as target-specific guidance to identify the target object, while a direct feature fusion of image and language can increase feature complexity and thus may be sub-optimal for RVOS. To this end, we propose a meta-transfer module, which is trained in a learning-to-learn fashion and aims to transfer the target-specific information from the language domain to the image domain, while discarding the uncorrelated complex variations of language description. To bridge the gap between the image and language domains, we develop a multi-scale cross-modal feature mining block that aggregates all the essential features required by RVOS from both domains and generates regression labels for the meta-transfer module. The whole system can be trained in an end-to-end manner and shows competitive performance against state-of-the-art two-stage approaches.

Keywords:

Computer science Feature (linguistics) Artificial intelligence Object (grammar) Modal Transfer of learning Domain (mathematical analysis) Block (permutation group theory) Segmentation Pattern recognition (psychology) Bridge (graph theory) Key (lock) Transfer (computing) Computer vision Natural language processing

Metrics

Cited By

2.90

FWCI (Field Weighted Citation Impact)

Refs

0.93

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Multimodal Machine Learning Applications

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Subtitles and Audiovisual Media

Social Sciences → Arts and Humanities → Language and Linguistics

Domain Adaptation and Few-Shot Learning

Physical Sciences → Computer Science → Artificial Intelligence

You Only Infer Once: Cross-Modal Meta-Transfer for Referring Video Object Segmentation

Abstract

Metrics

Citation History

Topics

Related Documents

Decoupled Cross-Modal Transformer for Referring Video Object Segmentation

Cross-modal Object Decoding and Referring Expression Decoupling for Referring Video Object Segmentation

Cross-modal Spectral Fusion Model for Referring Video Object Segmentation

Learning Cross-Modal Affinity for Referring Video Object Segmentation Targeting Limited Samples

Semantic-Assisted Object Clustering for Multi-Modal Referring Video Segmentation