VLN↻BERT: A Recurrent Vision-and-Language BERT for Navigation

Yicong Hong; Qi Wu; Yuankai Qi; Cristian Rodríguez-Opazo; Stephen Jay Gould

doi:10.1109/cvpr46437.2021.00169

ScienceGate Book Chapters

JOURNAL ARTICLE

VLN↻BERT: A Recurrent Vision-and-Language BERT for Navigation

Yicong Hong Qi Wu Yuankai Qi Cristian Rodríguez-Opazo Stephen Jay Gould

Year: 2021 Pages: 1643-1653

DOI: 10.1109/cvpr46437.2021.00169

Get Full-Text PDF Get Analytical Report

Abstract

Accuracy of many visiolinguistic tasks has benefited significantly from the application of vision-and-language (V&L) BERT. However, its application for the task of vision-and-language navigation (VLN) remains limited. One reason for this is the difficulty adapting the BERT architecture to the partially observable Markov decision process present in VLN, requiring history-dependent attention and decision making. In this paper we propose a recurrent BERT model that is time-aware for use in VLN. Specifically, we equip the BERT model with a recurrent function that maintains cross-modal state information for the agent. Through extensive experiments on R2R and REVERIE we demonstrate that our model can replace more complex encoder-decoder models to achieve state-of-the-art results. Moreover, our approach can be generalised to other transformer-based architectures, supports pre-training, and is capable of solving navigation and referring expression tasks simultaneously.

Keywords:

Computer science Encoder Transformer Architecture Artificial intelligence Process (computing) Language model Modal Markov decision process Markov process Human–computer interaction Computer vision Programming language Engineering

Metrics

204

Cited By

15.95

FWCI (Field Weighted Citation Impact)

Refs

0.99

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Multimodal Machine Learning Applications

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Domain Adaptation and Few-Shot Learning

Physical Sciences → Computer Science → Artificial Intelligence

Advanced Image and Video Retrieval Techniques

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

VLN↻BERT: A Recurrent Vision-and-Language BERT for Navigation

Abstract

Metrics

Citation History

Topics

Related Documents

A Recurrent Vision-and-Language BERT for Navigation

A Recurrent Vision-and-Language BERT for Navigation

Improved VLN-BERT with Reinforcing Endpoint Alignment for Vision-and-Language Navigation

Reinforced Vision-and-Language Navigation Based on Historical BERT

VLN-ChEnv: Vision-language Navigation in Changeable Environments