VideoElevator: Elevating Video Generation Quality with Versatile Text-to-Image Diffusion Models

Yabo Zhang; Yuxiang Wei; Xianhui Lin; Zheng Hui; Peiran Ren; Xuansong Xie; Wangmeng Zuo

doi:10.1609/aaai.v39i10.33114

ScienceGate Book Chapters

JOURNAL ARTICLE

VideoElevator: Elevating Video Generation Quality with Versatile Text-to-Image Diffusion Models

Yabo Zhang Yuxiang Wei Xianhui Lin Zheng Hui Peiran Ren Xuansong Xie Wangmeng Zuo

Year: 2025 Journal: Proceedings of the AAAI Conference on Artificial Intelligence Vol: 39 (10)Pages: 10266-10274 Publisher: Association for the Advancement of Artificial Intelligence

DOI: 10.1609/aaai.v39i10.33114

Get Full-Text PDF Get Analytical Report

Abstract

Text-to-image diffusion models (T2I) have demonstrated unprecedented capabilities in creating realistic and aesthetic images. On the contrary, text-to-video diffusion models (T2V) still lag far behind in frame quality and text alignment, owing to insufficient quality and quantity of training videos. In this paper, we introduce VideoElevator, a training-free and plug-and-play method, which elevates the performance of T2V using superior capabilities of T2I. Different from conventional T2V sampling (i.e., temporal and spatial modeling), VideoElevator explicitly decomposes each sampling step into temporal motion refining and spatial quality elevating. Specifically, temporal motion refining uses encapsulated T2V to enhance temporal consistency, followed by inverting to the noise distribution required by T2I. Then, spatial quality elevating harnesses inflated T2I to directly predict less noisy latent, adding more photo-realistic details. We have conducted experiments in extensive prompts under the combination of various T2V and T2I. The results show that VideoElevator not only improves the performance of T2V baselines with foundational T2I, but also facilitates stylistic video synthesis with personalized T2I. Please watch all videos in supplementary materials for better view.

Keywords:

Computer science Diffusion Quality (philosophy) Image (mathematics) Image quality Computer vision Computer graphics (images) Artificial intelligence Physics

Metrics

Cited By

7.34

FWCI (Field Weighted Citation Impact)

Refs

0.91

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Video Analysis and Summarization

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Multimedia Communication and Technology

Social Sciences → Social Sciences → Sociology and Political Science

Image Retrieval and Classification Techniques

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

VideoElevator: Elevating Video Generation Quality with Versatile Text-to-Image Diffusion Models

Abstract

Metrics

Citation History

Topics

Related Documents

DesignDiffusion: High-Quality Text-to-Design Image Generation with Diffusion Models

Conditional Text Image Generation with Diffusion Models

Grid Diffusion Models for Text-to-Video Generation

ShotAdapter: Text-to-Multi-Shot Video Generation with Diffusion Models

ColorDiffuser: Video Colorization with Pretrained Text-to-Image Diffusion Models