JOURNAL ARTICLE

Audio-Enhanced Text-to-Video Retrieval using Text-Conditioned Feature Alignment

Abstract

Text-to-video retrieval systems have recently made significant progress by utilizing pre-trained models trained on large-scale image-text pairs. However, most of the latest methods primarily focus on the video modality while disregarding the audio signal for this task. Nevertheless, a recent advancement by ECLIPSE has improved long-range text-to-video retrieval by developing an audiovisual video representation. Nonetheless, the objective of the text-to-video retrieval task is to capture the complementary audio and video information that is pertinent to the text query rather than simply achieving better audio and video alignment. To address this issue, we introduce TEFAL, a TExt-conditioned Feature ALignment method that produces both audio and video representations conditioned on the text query. Instead of using only an audiovisual attention block, which could suppress the audio information relevant to the text query, our approach employs two independent cross-modal attention blocks that enable the text to attend to the audio and video representations separately. Our proposed method's efficacy is demonstrated on four benchmark datasets that include audio: MSR-VTT, LSMDC, VATEX, and Charades, and achieves better than state-of-the-art performance consistently across the four datasets. This is attributed to the additional text-query-conditioned audio representation and the complementary information it adds to the text-query-conditioned video representation.

Keywords:
Computer science Task (project management) Benchmark (surveying) Modality (human–computer interaction) Video retrieval Information retrieval Feature (linguistics) Representation (politics) Artificial intelligence Speech recognition

Metrics

20
Cited By
3.64
FWCI (Field Weighted Citation Impact)
43
Refs
0.92
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Multimodal Machine Learning Applications
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Music and Audio Processing
Physical Sciences →  Computer Science →  Signal Processing
Advanced Image and Video Retrieval Techniques
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition

Related Documents

JOURNAL ARTICLE

Efficient Hierarchical Temporal Audio-Video Cross-Attention Fusion Network For Audio-Enhanced Text-To-Video Retrieval

R RashmiChethan H.K.

Journal:   Computer Engineering and Applications Journal Year: 2025 Vol: 14 (3)Pages: 211-238
JOURNAL ARTICLE

Video Retrieval Model Based on Video Text Alignment

宇 张

Journal:   Journal of Image and Signal Processing Year: 2025 Vol: 14 (03)Pages: 349-361
JOURNAL ARTICLE

Video and text semantic center alignment for text-video cross-modal retrieval

Ming JinHuaxiang ZhangLei ZhuJiande SunLi Liu

Journal:   Signal Processing Image Communication Year: 2025 Vol: 140 Pages: 117413-117413
© 2026 ScienceGate Book Chapters — All rights reserved.