Deformable VisTR: Spatio Temporal Deformable Attention for Video Instance Segmentation

Sudhir Yarram; Jiong Wu; Pan Ji; Yi Xu; Junsong Yuan

doi:10.1109/icassp43922.2022.9746665

ScienceGate Book Chapters

JOURNAL ARTICLE

Deformable VisTR: Spatio Temporal Deformable Attention for Video Instance Segmentation

Sudhir Yarram Jiong Wu Pan Ji Yi Xu Junsong Yuan

Year: 2022 Journal: ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) Pages: 3303-3307

DOI: 10.1109/icassp43922.2022.9746665

Get Full-Text PDF Get Analytical Report

Abstract

Video instance segmentation (VIS) task requires classifying, segmenting, and tracking object instances over all frames in a video clip. Recently, VisTR has been proposed as end-to-end transformer-based VIS framework, while demonstrating state-of-the-art performance. However, VisTR is slow to converge during training, requiring around 1000 GPU hours due to the high computational cost of its transformer attention module. To improve the training efficiency, we propose Deformable VisTR, leveraging spatio-temporal deformable attention module that only attends to a small fixed set of key spatio-temporal sampling points around a reference point. This enables Deformable VisTR to achieve linear computation in the size of spatio-temporal feature maps. Moreover, it can achieve on par performance as the original VisTR with 10$\times$ less GPU training hours. We validate the effectiveness of our method on the Youtube-VIS benchmark. Code is available at https://github.com/skrya/DefVIS.

Keywords:

Computer science Artificial intelligence Segmentation Benchmark (surveying) Computer vision Video tracking Computation Transformer Feature (linguistics) Pattern recognition (psychology) Object (grammar) Algorithm Cartography

Metrics

Cited By

2.67

FWCI (Field Weighted Citation Impact)

Refs

0.74

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Hermeneutics and Narrative Identity

Social Sciences → Arts and Humanities → Philosophy

Aging, Elder Care, and Social Issues

Health Sciences → Health Professions → General Health Professions

Health, Medicine and Society

Health Sciences → Health Professions → General Health Professions

Deformable VisTR: Spatio Temporal Deformable Attention for Video Instance Segmentation

Abstract

Metrics

Citation History

Topics

Related Documents

Spatio-Temporal Deformable Attention Network for Video Deblurring

Spatio-Temporal Attention Network for Video Instance Segmentation

Size-Modulated Deformable Attention in Spatio-Temporal Video Grounding Pipelines

Patch-Based Spatio-Temporal Deformable Attention BiRNN for Video Deblurring

DSTA-Net: Deformable Spatio-Temporal Attention Network for Video Inpainting