CoSTA: End-to-End Comprehensive Space-Time Entanglement for Spatio-Temporal Video Grounding

Yaoyuan Liang; Xiao Liang; Yansong Tang; Zhao Yang; Ziran Li; Jingang Wang; Wenbo Ding; Shao‐Lun Huang

doi:10.1609/aaai.v38i4.28118

ScienceGate Book Chapters

JOURNAL ARTICLE

CoSTA: End-to-End Comprehensive Space-Time Entanglement for Spatio-Temporal Video Grounding

Yaoyuan Liang Xiao Liang Yansong Tang Zhao Yang Ziran Li Jingang Wang Wenbo Ding Shao‐Lun Huang

Year: 2024 Journal: Proceedings of the AAAI Conference on Artificial Intelligence Vol: 38 (4)Pages: 3324-3332 Publisher: Association for the Advancement of Artificial Intelligence

DOI: 10.1609/aaai.v38i4.28118

Get Full-Text PDF Get Analytical Report

Abstract

This paper studies the spatio-temporal video grounding task, which aims to localize a spatio-temporal tube in an untrimmed video based on the given text description of an event. Existing one-stage approaches suffer from insufficient space-time interaction in two aspects: i) less precise prediction of event temporal boundaries, and ii) inconsistency in object prediction for the same event across adjacent frames. To address these issues, we propose a framework of Comprehensive Space-Time entAnglement (CoSTA) to densely entangle space-time multi-modal features for spatio-temporal localization. Specifically, we propose a space-time collaborative encoder to extract comprehensive video features and leverage Transformer to perform spatio-temporal multi-modal understanding. Our entangled decoder couples temporal boundary prediction and spatial localization via an entangled query, boasting an enhanced ability to capture object-event relationships. We conduct extensive experiments on the challenging benchmarks of HC-STVG and VidSTG, where CoSTA outperforms existing state-of-the-art methods, demonstrating its effectiveness for this task.

Keywords:

End-to-end principle Quantum entanglement Dead end Computer science End user Ground Telecommunications Geography Electrical engineering Physics Computer network Engineering Geometry Mathematics World Wide Web Quantum mechanics

Metrics

Cited By

0.58

FWCI (Field Weighted Citation Impact)

Refs

0.44

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Multimodal Machine Learning Applications

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Human Pose and Action Recognition

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Advanced Vision and Imaging

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

CoSTA: End-to-End Comprehensive Space-Time Entanglement for Spatio-Temporal Video Grounding

Abstract

Metrics

Citation History

Topics

Related Documents

End-to-end Multi-task Learning Framework for Spatio-Temporal Grounding in Video Corpus

HERO: HiErarchical spatio-tempoRal reasOning with Contrastive Action Correspondence for End-to-End Video Object Grounding

End-to-End Spatio-Temporal Action Localisation with Video Transformers

VSTRD: An end-to-end video spatio-temporal relation detection transformer

End-to-End Learning of Video Compression using Spatio-Temporal Autoencoders