BOOK-CHAPTER

Dynamic Multi-Target Fusion for Efficient Audio-Visual Navigation

Abstract

Audiovisual embodied navigation enables robots to locate audio sources by dynamically integrating visual observations from onboard sensors with the auditory signals emitted by the target. The core challenge lies in effectively leveraging multimodal cues to guide navigation. While prior works have explored basic fusion of visual and audio data, they often overlook deeper perceptual context. To address this, we propose the Dynamic Multi-Target Fusion for Efficient Audio-Visual Navigation (DMTF-AVN). Our approach uses a multi-target architecture coupled with a refined Transformer mechanism to filter and selectively fuse cross-modal information. Extensive experiments on the Replica and Matterport3D datasets demonstrate that DMTF-AVN achieves state-of-the-art performance, outperforming existing methods in success rate (SR), path efficiency (SPL), and scene adaptation (SNA). Furthermore, the model exhibits strong scalability and generalizability, paving the way for advanced multimodal fusion strategies in robotic navigation. The code and videos are available at https://github.com/zzzmmm-svg/DMTF.

Keywords:

Metrics

0
Cited By
0.00
FWCI (Field Weighted Citation Impact)
0
Refs
0.85
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Topics

Speech and Audio Processing
Physical Sciences →  Computer Science →  Signal Processing
Video Surveillance and Tracking Methods
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Advanced Measurement and Detection Methods
Physical Sciences →  Engineering →  Electrical and Electronic Engineering

Related Documents

BOOK-CHAPTER

Dual-Stream Gated Fusion for Audio-Visual Navigation

Jia LiMeiling ZhuYinfeng Yu

Lecture notes in electrical engineering Year: 2025 Pages: 228-237
JOURNAL ARTICLE

Distributed Audio-Visual Multi-Target Tracking

Wu, Peipei

Journal:   Surrey Open Research repository (University of Surrey) Year: 2024
JOURNAL ARTICLE

Efficient Audio-Visual Speaker Recognition Via Deep Multi-Modal Feature Fusion

Yufei Wang

Journal:   2021 17th International Conference on Computational Intelligence and Security (CIS) Year: 2021 Vol: 21 Pages: 99-103
© 2026 ScienceGate Book Chapters — All rights reserved.