MAGVIT: Masked Generative Video Transformer

Lijun Yu; Yong Cheng; Kihyuk Sohn; José Lezama; Han Zhang; Hui‐Wen Chang; Alexander G. Hauptmann; Ming–Hsuan Yang; Hao Yuan; Irfan Essa; Lu Jiang

doi:10.1109/cvpr52729.2023.01008

JOURNAL ARTICLE

MAGVIT: Masked Generative Video Transformer

Lijun Yu Yong Cheng Kihyuk Sohn José Lezama Han Zhang Hui‐Wen Chang Alexander G. Hauptmann Ming–Hsuan Yang Hao Yuan Irfan Essa Lu Jiang

Year: 2023 Pages: 10459-10469

DOI: 10.1109/cvpr52729.2023.01008

Get Full-Text PDF Get Analytical Report

Abstract

We introduce the MAsked Generative VIdeo Transformer, MAGVIT, to tackle various video synthesis tasks with a single model. We introduce a 3D tokenizer to quantize a video into spatial-temporal visual tokens and propose an embedding method for masked video token modeling to facilitate multi-task learning. We conduct extensive experiments to demonstrate the quality, efficiency, and flexibility of MAGVIT. Our experiments show that (i) MAGVIT performs favorably against state-of-the-art approaches and establishes the best-published FVD on three video generation benchmarks, including the challenging Kinetics-600. (ii) MAGVIT outperforms existing methods in inference time by two orders of magnitude against diffusion models and by 60x against autoregressive models. (iii) A single MAGVIT model supports ten diverse generation tasks and generalizes across videos from different visual domains. The source code and trained models will be released to the public at https://magvit.cs.cmu.edu.

Keywords:

Computer science Security token Transformer Embedding Inference Generative model Autoregressive model Artificial intelligence Generative grammar Flexibility (engineering) Source code Speech recognition Machine learning Pattern recognition (psychology) Programming language

Metrics

Cited By

15.10

FWCI (Field Weighted Citation Impact)

115

Refs

0.99

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Generative Adversarial Networks and Image Synthesis

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Video Analysis and Summarization

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Advanced Vision and Imaging

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

MAGVIT: Masked Generative Video Transformer

Abstract

Metrics

Citation History

Topics

Related Documents

MaskGIT: Masked Generative Image Transformer

MAGVLT: Masked Generative Vision-and-Language Transformer

Accelerated masked transformer for dense video captioning

SpecMaskFoley: Steering Pretrained Spectral Masked Generative Transformer Toward Synchronized Video-to-audio Synthesis via ControlNet

Video Anomaly Detection Based on Random Masked Transformer