Text-Visual Prompting for Efficient 2D Temporal Video Grounding

Yimeng Zhang; Xin Chen; Jinghan Jia; Sijia Liu; Ke Ding

doi:10.1109/cvpr52729.2023.01421

ScienceGate Book Chapters

JOURNAL ARTICLE

Text-Visual Prompting for Efficient 2D Temporal Video Grounding

Yimeng Zhang Xin Chen Jinghan Jia Sijia Liu Ke Ding

Year: 2023 Pages: 14794-14804

DOI: 10.1109/cvpr52729.2023.01421

Get Full-Text PDF Get Analytical Report

Abstract

In this paper, we study the problem of temporal video grounding (TVG), which aims to predict the starting/ending time points of moments described by a text sentence within a long untrimmed video. Benefiting from fine-grained 3D visual features, the TVG techniques have achieved remarkable progress in recent years. However, the high complexity of 3D convolutional neural networks (CNNs) makes extracting dense 3D visual features time-consuming, which calls for intensive memory and computing resources. Towards efficient TVG, we propose a novel text-visual prompting (TVP) framework, which incorporates optimized perturbation patterns (that we call 'prompts') into both visual inputs and textual features of a TVG model. In sharp contrast to 3D CNNs, we show that TVP allows us to effectively co-train vision encoder and language encoder in a 2D TVG model and improves the performance of crossmodal feature fusion using only low-complexity sparse 2D visual features. Further, we propose a Temporal-Distance IoU (TDIoU) loss for efficient learning of TVG. Experiments on two benchmark datasets, Charades-STA and Activityblet Captions datasets, empirically show that the proposed TVP significantly boosts the performance of 2D TVG (e.g., 9.79% improvement on Charades-STA and 30.77% improvement on ActivityNet Captions) and achieves $5\times$ inference acceleration over TVG using 3D visual features. Codes are available at Open.Intel.

Keywords:

Computer science Encoder Artificial intelligence Convolutional neural network Inference Benchmark (surveying) Pattern recognition (psychology)

Metrics

Cited By

4.91

FWCI (Field Weighted Citation Impact)

118

Refs

0.94

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Multimodal Machine Learning Applications

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Human Pose and Action Recognition

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Video Analysis and Summarization

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Text-Visual Prompting for Efficient 2D Temporal Video Grounding

Abstract

Metrics

Citation History

Topics

Related Documents

Video-Text Prompting for Weakly Supervised Spatio-Temporal Video Grounding

Dynamic Multi-modal Prompting for Efficient Visual Grounding

Text-Guided Visual Representation Optimization for Sensor-Acquired Video Temporal Grounding

Local-Global Video-Text Interactions for Temporal Grounding

Prompting Visual-Language Models for Efficient Video Understanding