JOURNAL ARTICLE

Deformable VisTR: Spatio Temporal Deformable Attention for Video Instance Segmentation

Sudhir YarramJiong WuPan JiYi XuJunsong Yuan

Year: 2022 Journal:   ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) Pages: 3303-3307

Abstract

Video instance segmentation (VIS) task requires classifying, segmenting, and tracking object instances over all frames in a video clip. Recently, VisTR has been proposed as end-to-end transformer-based VIS framework, while demonstrating state-of-the-art performance. However, VisTR is slow to converge during training, requiring around 1000 GPU hours due to the high computational cost of its transformer attention module. To improve the training efficiency, we propose Deformable VisTR, leveraging spatio-temporal deformable attention module that only attends to a small fixed set of key spatio-temporal sampling points around a reference point. This enables Deformable VisTR to achieve linear computation in the size of spatio-temporal feature maps. Moreover, it can achieve on par performance as the original VisTR with 10$\times$ less GPU training hours. We validate the effectiveness of our method on the Youtube-VIS benchmark. Code is available at https://github.com/skrya/DefVIS.

Keywords:
Computer science Artificial intelligence Segmentation Benchmark (surveying) Computer vision Video tracking Computation Transformer Feature (linguistics) Pattern recognition (psychology) Object (grammar) Algorithm Cartography

Metrics

1
Cited By
2.67
FWCI (Field Weighted Citation Impact)
19
Refs
0.74
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Hermeneutics and Narrative Identity
Social Sciences →  Arts and Humanities →  Philosophy
Aging, Elder Care, and Social Issues
Health Sciences →  Health Professions →  General Health Professions
Health, Medicine and Society
Health Sciences →  Health Professions →  General Health Professions

Related Documents

BOOK-CHAPTER

Spatio-Temporal Deformable Attention Network for Video Deblurring

Huicong ZhangHaozhe XieHongxun Yao

Lecture notes in computer science Year: 2022 Pages: 581-596
JOURNAL ARTICLE

Patch-Based Spatio-Temporal Deformable Attention BiRNN for Video Deblurring

Huicong ZhangHaozhe XieShengping ZhangHongxun Yao

Journal:   IEEE Transactions on Circuits and Systems for Video Technology Year: 2025 Vol: 35 (6)Pages: 5545-5559
JOURNAL ARTICLE

DSTA-Net: Deformable Spatio-Temporal Attention Network for Video Inpainting

Tongxing LiuGuoxin QiuHanyu Xuan

Journal:   IEEE Signal Processing Letters Year: 2024 Vol: 32 Pages: 771-775
© 2026 ScienceGate Book Chapters — All rights reserved.