JOURNAL ARTICLE

Towards Global Video Scene Segmentation with Context-Aware Transformer

Yang YangYurui HuangWei-Li GuoBaohua XuDingyin Xia

Year: 2023 Journal:   Proceedings of the AAAI Conference on Artificial Intelligence Vol: 37 (3)Pages: 3206-3213   Publisher: Association for the Advancement of Artificial Intelligence

Abstract

Videos such as movies or TV episodes usually need to divide the long storyline into cohesive units, i.e., scenes, to facilitate the understanding of video semantics. The key challenge lies in finding the boundaries of scenes by comprehensively considering the complex temporal structure and semantic information. To this end, we introduce a novel Context-Aware Transformer (CAT) with a self-supervised learning framework to learn high-quality shot representations, for generating well-bounded scenes. More specifically, we design the CAT with local-global self-attentions, which can effectively consider both the long-term and short-term context to improve the shot encoding. For training the CAT, we adopt the self-supervised learning schema. Firstly, we leverage shot-to-scene level pretext tasks to facilitate the pre-training with pseudo boundary, which guides CAT to learn the discriminative shot representations that maximize intra-scene similarity and inter-scene discrimination in an unsupervised manner. Then, we transfer contextual representations for fine-tuning the CAT with supervised data, which encourages CAT to accurately detect the boundary for scene segmentation. As a result, CAT is able to learn the context-aware shot representations and provides global guidance for scene segmentation. Our empirical analyses show that CAT can achieve state-of-the-art performance when conducting the scene segmentation task on the MovieNet dataset, e.g., offering 2.15 improvements on AP.

Keywords:
Computer science Segmentation Artificial intelligence Leverage (statistics) Discriminative model Computer vision Transformer

Metrics

10
Cited By
0.80
FWCI (Field Weighted Citation Impact)
62
Refs
0.67
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Video Analysis and Summarization
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Multimodal Machine Learning Applications
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Human Pose and Action Recognition
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition

Related Documents

JOURNAL ARTICLE

Local-Global Context Aware Transformer for Language-Guided Video Segmentation

Chen LiangWenguan WangTianfei ZhouJiaxu MiaoYawei LuoYi Yang

Journal:   IEEE Transactions on Pattern Analysis and Machine Intelligence Year: 2023 Vol: 45 (8)Pages: 10055-10069
JOURNAL ARTICLE

SAFIT: Segmentation-Aware Scene Flow with Improved Transformer

Yukang ShiKaisheng Ma

Journal:   2022 International Conference on Robotics and Automation (ICRA) Year: 2022 Pages: 10648-10655
JOURNAL ARTICLE

Context-aware and local-aware fusion with transformer for medical image segmentation

Hanguang XiaoLi LiQiyuan LiuQihang ZhangJunqi LiuZhi Liu

Journal:   Physics in Medicine and Biology Year: 2023 Vol: 69 (2)Pages: 025011-025011
© 2026 ScienceGate Book Chapters — All rights reserved.