JOURNAL ARTICLE

ESceme: Vision-and-Language Navigation with Episodic Scene Memory

Qi ZhengDaqing LiuChaoyue WangJing ZhangDadong WangDacheng Tao

Year: 2024 Journal:   International Journal of Computer Vision Vol: 133 (1)Pages: 254-274   Publisher: Springer Science+Business Media

Abstract

Abstract Vision-and-language navigation (VLN) simulates a visual agent that follows natural-language navigation instructions in real-world scenes. Existing approaches have made enormous progress in navigation in new environments, such as beam search, pre-exploration, and dynamic or hierarchical history encoding. To balance generalization and efficiency, we resort to memorizing visited scenarios apart from the ongoing route while navigating. In this work, we introduce a mechanism of Episodic Scene memory (ESceme) for VLN that wakes an agent’s memories of past visits when it enters the current scene. The episodic scene memory allows the agent to envision a bigger picture of the next prediction. This way, the agent learns to utilize dynamically updated information instead of merely adapting to the current observations. We provide a simple yet effective implementation of ESceme by enhancing the accessible views at each location and progressively completing the memory while navigating. We verify the superiority of ESceme on short-horizon (R2R), long-horizon (R4R), and vision-and-dialog (CVDN) VLN tasks. Our ESceme also wins first place on the CVDN leaderboard. Code is available: https://github.com/qizhust/esceme .

Keywords:
Computer science Memorization Encoding (memory) Episodic memory Generalization Artificial intelligence Code (set theory) Horizon Human–computer interaction Natural language Dialog box Computer vision Programming language Cognitive psychology Psychology

Metrics

1
Cited By
0.53
FWCI (Field Weighted Citation Impact)
59
Refs
0.54
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Multimodal Machine Learning Applications
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Advanced Image and Video Retrieval Techniques
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Domain Adaptation and Few-Shot Learning
Physical Sciences →  Computer Science →  Artificial Intelligence

Related Documents

JOURNAL ARTICLE

Planning from Imagination: Episodic Simulation and Episodic Memory for Vision-and-Language Navigation

Yuanlong PanYunzhe XuZhe LiuHesheng Wang

Journal:   Proceedings of the AAAI Conference on Artificial Intelligence Year: 2025 Vol: 39 (6)Pages: 6345-6353
JOURNAL ARTICLE

Episodic Transformer for Vision-and-Language Navigation

Alexander PashevichCordelia SchmidChen Sun

Journal:   2021 IEEE/CVF International Conference on Computer Vision (ICCV) Year: 2021
JOURNAL ARTICLE

Memory-Adaptive Vision-and-Language Navigation

Keji HeYa JingYan HuangZhihe LuDong AnLiang Wang

Journal:   Pattern Recognition Year: 2024 Vol: 153 Pages: 110511-110511
JOURNAL ARTICLE

Enhancing Vision and Language Navigation With Prompt-Based Scene Knowledge

Zhaohuan ZhanJinghui QinWei ZhuoGuang Tan

Journal:   IEEE Transactions on Circuits and Systems for Video Technology Year: 2024 Vol: 34 (10)Pages: 9745-9756
JOURNAL ARTICLE

Indoor Scene Recognition in Vision-and-Language Navigation

Hongtao ZhangYuankai QiMingbo ZhaoYuping Liu

Journal:   IEEE Transactions on Consumer Electronics Year: 2025 Pages: 1-1
© 2026 ScienceGate Book Chapters — All rights reserved.