Qiming LiuXinmin DuZhe LiuHesheng Wang
Visual navigation is fundamental for embodied agents operating in expansive workspaces. The cognitive abilities of these agents form the essential basis for creating intelligent behavioral patterns. Memory and reasoning are vital components among these abilities. The former enhances decision-making by preserving a wide array of episodic spatio-temporal perception cues, while the latter allows proactive and advanced probabilistic inference of task distributions based on long-term experiences. Despite individual studies on these two cognitive modalities, their integration for enhanced decision-making presents a considerable challenge due to their substantial differences in representation and behavioral characteristics. In this paper, we introduce Semantic-based Multi-modal Cognitive Graph (SMCG) for intelligent visual navigation. This framework is distinguished by its unified semantic-level representation of both memory and reasoning capabilities. Specifically, SMCG, rather than directly memorizing perceptual features as per previous methods, records observed object sequences. Simultaneously, reasoning is based on a semantic relation graph that represents correlations among objects. We additionally develop a hierarchical cognition extraction (HCE) pipeline and employ it to decode cognitive cues within SMCG and situation-aware subgraphs, thereby enhancing intelligent navigation behavior. Experimental results in image-goal navigation show pronounced performance improvements, credited to the effective induction and rational application of heterogeneous cognitive modalities.
Xinzhu LiuDi GuoHuaping LiuFuchun Sun