End-to-end Multi-task Learning Framework for Spatio-Temporal Grounding in Video Corpus

Yingqi Gao; Zhiling Luo; Shiqian Chen; Wei Zhou

doi:10.1145/3511808.3557596

ScienceGate Book Chapters

JOURNAL ARTICLE

End-to-end Multi-task Learning Framework for Spatio-Temporal Grounding in Video Corpus

Yingqi Gao Zhiling Luo Shiqian Chen Wei Zhou

Year: 2022 Journal: Proceedings of the 31st ACM International Conference on Information & Knowledge Management Pages: 3958-3962

DOI: 10.1145/3511808.3557596

Get Full-Text PDF Get Analytical Report

Abstract

In this paper, we consider a novel task, Video Corpus Spatio-Temporal Grounding (VCSTG) for material selection and spatio-temporal adaption in intelligent video editing. Given a text query depicting an object and a corpus of untrimmed and unsegmented videos, VCSTG aims to localize a sequence of spatio-temporal object tubes from the video corpus. Existing methods tackle the VCSTG task in a multi-stage approach, which encodes the query and video representation independently for each task, leading to local optimum. In this paper, we propose a novel one-stage multi-task learning based framework named MTSTG for the VCSTG task. MTSTG learns unified query and video representation for video retrieval, temporal grounding and spatial grounding tasks. Video-level, frame-level and object-level contrastive learning are introduced to measure the mutual information between query and video at different granularity. Comprehensive experiments demonstrate our newly proposed framework outperforms the state-of-the-art multi-stage methods on VidSTG dataset.

Keywords:

Computer science End-to-end principle Task (project management) Ground Artificial intelligence Human–computer interaction Engineering Systems engineering Electrical engineering

Metrics

Cited By

0.14

FWCI (Field Weighted Citation Impact)

Refs

0.42

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Multimodal Machine Learning Applications

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Natural Language Processing Techniques

Physical Sciences → Computer Science → Artificial Intelligence

Human Pose and Action Recognition

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

End-to-end Multi-task Learning Framework for Spatio-Temporal Grounding in Video Corpus

Abstract

Metrics

Citation History

Topics

Related Documents

Towards Spatio-temporal Collaborative Learning: An End-to-End Deepfake Video Detection Framework

CoSTA: End-to-End Comprehensive Space-Time Entanglement for Spatio-Temporal Video Grounding

End-to-End Learning of Video Compression using Spatio-Temporal Autoencoders

HERO: HiErarchical spatio-tempoRal reasOning with Contrastive Action Correspondence for End-to-End Video Object Grounding

End-to-End Spatio-Temporal Action Localisation with Video Transformers