JOURNAL ARTICLE

DeCo: Decomposition and Reconstruction for Compositional Temporal Grounding via Coarse-to-Fine Contrastive Ranking

Abstract

Understanding dense action in videos is a fundamental challenge towards the generalization of vision models. Several works show that compositionality is key to achieving generalization by combining known primitive elements, especially for handling novel composited structures. Compositional temporal grounding is the task of localizing dense action by using known words combined in novel ways in the form of novel query sentences for the actual grounding. In recent works, composition is assumed to be learned from pairs of whole videos and language embeddings through large scale self-supervised pre-training. Alternatively, one can process the video and language into word-level primitive elements, and then only learn fine-grained semantic correspondences. Both approaches do not consider the granularity of the compositions, where different query granularity corresponds to different video segments. Therefore, a good compositional representation should be sensitive to different video and query granularity. We propose a method to learn a coarse-to-fine compositional representation by decomposing the original query sentence into different granular levels, and then learning the correct correspondences between the video and recombined queries through a contrastive ranking constraint. Additionally, we run temporal boundary prediction in a coarse-to-fine manner for precise grounding boundary detection. Experiments are performed on two datasets, Charades-CG and ActivityNet-CG, showing the superior compositional generalizability of our approach.

Keywords:
Granularity Computer science Principle of compositionality Ranking (information retrieval) Artificial intelligence Generalization Representation (politics) Natural language processing Sentence Task (project management) Constraint (computer-aided design) Process (computing) Flexibility (engineering) Boundary (topology) Mathematics

Metrics

13
Cited By
2.37
FWCI (Field Weighted Citation Impact)
53
Refs
0.87
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Human Pose and Action Recognition
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Multimodal Machine Learning Applications
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Domain Adaptation and Few-Shot Learning
Physical Sciences →  Computer Science →  Artificial Intelligence

Related Documents

JOURNAL ARTICLE

Fine-Grained Text-to-Video Temporal Grounding from Coarse Boundary

Jiachang HaoHaifeng SunPengfei RenYiming ZhongJingyu WangQi QiJianxin Liao

Journal:   ACM Transactions on Multimedia Computing Communications and Applications Year: 2022 Vol: 19 (5)Pages: 1-21
JOURNAL ARTICLE

Coarse-to-Fine Contrastive Learning on Graphs

Peiyao ZhaoYuangang PanXin LiXu ChenIvor W. TsangLejian Liao

Journal:   IEEE Transactions on Neural Networks and Learning Systems Year: 2023 Vol: 35 (4)Pages: 4622-4634
© 2026 ScienceGate Book Chapters — All rights reserved.