DeCo: Decomposition and Reconstruction for Compositional Temporal Grounding via Coarse-to-Fine Contrastive Ranking

Lijin Yang; Quan Kong; Hsuan-Kung Yang; Wadim Kehl; Yoichi Sato; Norimasa Kobori

doi:10.1109/cvpr52729.2023.02215

ScienceGate Book Chapters

JOURNAL ARTICLE

DeCo: Decomposition and Reconstruction for Compositional Temporal Grounding via Coarse-to-Fine Contrastive Ranking

Lijin Yang Quan Kong Hsuan-Kung Yang Wadim Kehl Yoichi Sato Norimasa Kobori

Year: 2023 Pages: 23130-23140

DOI: 10.1109/cvpr52729.2023.02215

Get Full-Text PDF Get Analytical Report

Abstract

Understanding dense action in videos is a fundamental challenge towards the generalization of vision models. Several works show that compositionality is key to achieving generalization by combining known primitive elements, especially for handling novel composited structures. Compositional temporal grounding is the task of localizing dense action by using known words combined in novel ways in the form of novel query sentences for the actual grounding. In recent works, composition is assumed to be learned from pairs of whole videos and language embeddings through large scale self-supervised pre-training. Alternatively, one can process the video and language into word-level primitive elements, and then only learn fine-grained semantic correspondences. Both approaches do not consider the granularity of the compositions, where different query granularity corresponds to different video segments. Therefore, a good compositional representation should be sensitive to different video and query granularity. We propose a method to learn a coarse-to-fine compositional representation by decomposing the original query sentence into different granular levels, and then learning the correct correspondences between the video and recombined queries through a contrastive ranking constraint. Additionally, we run temporal boundary prediction in a coarse-to-fine manner for precise grounding boundary detection. Experiments are performed on two datasets, Charades-CG and ActivityNet-CG, showing the superior compositional generalizability of our approach.

Keywords:

Granularity Computer science Principle of compositionality Ranking (information retrieval) Artificial intelligence Generalization Representation (politics) Natural language processing Sentence Task (project management) Constraint (computer-aided design) Process (computing) Flexibility (engineering) Boundary (topology) Mathematics

Metrics

Cited By

2.37

FWCI (Field Weighted Citation Impact)

Refs

0.87

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Human Pose and Action Recognition

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Multimodal Machine Learning Applications

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Domain Adaptation and Few-Shot Learning

Physical Sciences → Computer Science → Artificial Intelligence

DeCo: Decomposition and Reconstruction for Compositional Temporal Grounding via Coarse-to-Fine Contrastive Ranking

Abstract

Metrics

Citation History

Topics

Related Documents

Coarse-to-Fine Spatial-Temporal Relationship Inference for Temporal Sentence Grounding

Fine-Grained Text-to-Video Temporal Grounding from Coarse Boundary

SHINE: Saliency-Aware Hierarchical Negative Ranking for Compositional Temporal Grounding

Contrastive Diffusion Model with Auxiliary Guidance for Coarse-to-Fine PET Reconstruction

Coarse-to-Fine Contrastive Learning on Graphs