JOURNAL ARTICLE

Multi-Level Cross-Modal Semantic Alignment Network for Video–Text Retrieval

Fudong NianLing DingYuxia HuYanhong Gu

Year: 2022 Journal:   Mathematics Vol: 10 (18)Pages: 3346-3346   Publisher: Multidisciplinary Digital Publishing Institute

Abstract

This paper strives to improve the performance of video–text retrieval. To date, many algorithms have been proposed to facilitate the similarity measure of video–text retrieval from the single global semantic to multi-level semantics. However, these methods may suffer from the following limitations: (1) largely ignore the relationship semantic which results in semantic levels are insufficient; (2) it is incomplete to constrain the real-valued features of different modalities to be in the same space only through the feature distance measurement; (3) fail to handle the problem that the distributions of attribute labels in different semantic levels are heavily imbalanced. To overcome the above limitations, this paper proposes a novel multi-level cross-modal semantic alignment network (MCSAN) for video–text retrieval by jointly modeling video–text similarity on global, entity, action and relationship semantic levels in a unified deep model. Specifically, both video and text are first decomposed into global, entity, action and relationship semantic levels by carefully designing spatial–temporal semantic learning structures. Then, we utilize KLDivLoss and a cross-modal parameter-share attribute projection layer as statistical constraints to ensure that representations from different modalities in different semantic levels are projected into a common semantic space. In addition, a novel focal binary cross-entropy (FBCE) loss function is presented, which is the first effort to model the unbalanced attribute distribution problem for video–text retrieval. MCSAN is practically effective to take the advantage of the complementary information among four semantic levels. Extensive experiments on two challenging video–text retrieval datasets, namely, MSR-VTT and VATEX, show the viability of our method.

Keywords:
Computer science Semantic similarity Information retrieval Semantic feature Semantics (computer science) Semantic compression Semantic computing Artificial intelligence Feature (linguistics) Explicit semantic analysis Semantic space Similarity (geometry) Natural language processing Semantic technology Image (mathematics) Semantic Web

Metrics

3
Cited By
0.37
FWCI (Field Weighted Citation Impact)
53
Refs
0.55
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Multimodal Machine Learning Applications
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Human Pose and Action Recognition
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Advanced Image and Video Retrieval Techniques
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition

Related Documents

JOURNAL ARTICLE

Video and text semantic center alignment for text-video cross-modal retrieval

Ming JinHuaxiang ZhangLei ZhuJiande SunLi Liu

Journal:   Signal Processing Image Communication Year: 2025 Vol: 140 Pages: 117413-117413
JOURNAL ARTICLE

Semantic enhancement and multi-level alignment network for cross-modal retrieval

Jia ChenHong Zhang

Journal:   Multimedia Tools and Applications Year: 2024 Vol: 83 (40)Pages: 88221-88243
JOURNAL ARTICLE

Multilevel Semantic Interaction Alignment for Video–Text Cross-Modal Retrieval

L. ChenZhen DengLibo LiuShibai Yin

Journal:   IEEE Transactions on Circuits and Systems for Video Technology Year: 2024 Vol: 34 (7)Pages: 6559-6575
JOURNAL ARTICLE

Multi-Modal Cross-Domain Alignment Network for Video Moment Retrieval

Xiang FangDaizong LiuPan ZhouYuchong Hu

Journal:   IEEE Transactions on Multimedia Year: 2022 Vol: 25 Pages: 7517-7532
© 2026 ScienceGate Book Chapters — All rights reserved.