JOURNAL ARTICLE

Depth-Aware Sparse Transformer for Video-Language Learning

Abstract

In Video-Language (VL) learning tasks, a massive amount of text annotations are describing geometrical relationships of instances (e.g. 19.6% to 45.0% in MSVD, MSR-VTT, MSVD-QA and MSVRTT-QA), which often become the bottleneck of the current VL tasks (e.g. 60.8% vs. 98.2% CIDEr in MSVD for geometrical and non-geometrical annotations). Considering the rich spatial information of depth map, an intuitive way is to enrich the conventional 2D visual representations with depth information through current SOTA models, e.g. transformer. However, it is cumbersome to compute the self-attention on a long-range sequence and heterogeneous video-level representations with regard to computation cost and flexibility on various frame scales. To tackle this, we propose a hierarchical transformer, termed Depth-Aware Sparse Transformer (DAST). Specifically, to guarantee computational efficiency, a depth-aware sparse attention modular with linear computational complexity is designed for each transformer layer to learn depth-aware 2D representations. Furthermore, we design a hierarchical structure to maintain multi-scale temporal coherence across long-range dependencies. These qualities of DAST make it compatible with a broad range of video-language tasks, including video captioning (achieving MSVD 107.8%, MSR-VTT 52.5% for CIDEr), video question answering (MSVD-QA 44.1%, MSRVTT-QA 39.4%), and video-text matching (MSR-VTT 215.7 for SumR). Our code is available at https://github.com/zchoi/DAST

Keywords:
Computer science Transformer Modular design Visualization Bottleneck Artificial intelligence Embedded system

Metrics

10
Cited By
1.82
FWCI (Field Weighted Citation Impact)
35
Refs
0.83
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Multimodal Machine Learning Applications
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Human Pose and Action Recognition
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Video Analysis and Summarization
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition

Related Documents

JOURNAL ARTICLE

Locality-Aware Transformer for Video-Based Sign Language Translation

Zihui GuoYonghong HouChunping HouWenjie Yin

Journal:   IEEE Signal Processing Letters Year: 2023 Vol: 30 Pages: 364-368
JOURNAL ARTICLE

Learning Trajectory-Aware Transformer for Video Super-Resolution

Chengxu LiuHuan YangJianlong FuXueming Qian

Journal:   2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Year: 2022 Pages: 5677-5686
BOOK-CHAPTER

Depth Estimation Using Sparse Depth and Transformer

Roopak MalikPraful HambardeSubrahmanyam Murala

Communications in computer and information science Year: 2022 Pages: 329-337
JOURNAL ARTICLE

Depth-Aware Video Abstraction

Jianbing ShenYing He

Year: 2010 Vol: 26 Pages: 475-480
© 2026 ScienceGate Book Chapters — All rights reserved.