Conditional Image-to-Video Generation with Latent Flow Diffusion Models

Haomiao Ni; Changhao Shi; Kai Li; Xiaolei Huang; Martin Renqiang Min

doi:10.1109/cvpr52729.2023.01769

ScienceGate Book Chapters

JOURNAL ARTICLE

Conditional Image-to-Video Generation with Latent Flow Diffusion Models

Haomiao Ni Changhao Shi Kai Li Xiaolei Huang Martin Renqiang Min

Year: 2023 Pages: 18444-18455

DOI: 10.1109/cvpr52729.2023.01769

Get Full-Text PDF Get Analytical Report

Abstract

Conditional image-to-video (cI2V) generation aims to synthesize a new plausible video starting from an image (e.g., a person's face) and a condition (e.g., an action class label like smile). The key challenge of the cI2V task lies in the simultaneous generation of realistic spatial appearance and temporal dynamics corresponding to the given image and condition. In this paper, we propose an approach for cI2V using novel latent flow diffusion models (LFDM) that synthesize an optical flow sequence in the latent space based on the given condition to warp the given image. Compared to previous direct-synthesis-based works, our proposed LFDM can better synthesize spatial details and temporal motion by fully utilizing the spatial content of the given image and warping it in the latent space according to the generated temporally-coherent flow. The training of LFDM consists of two separate stages: (1) an unsupervised learning stage to train a latent flow auto-encoder for spatial content generation, including a flow predictor to estimate latent flow between pairs of video frames, and (2) a conditional learning stage to train a 3D-UNet-based diffusion model (DM) for temporal latent flow generation. Unlike previous DMs operating in pixel space or latent feature space that couples spatial and temporal information, the DM in our LFDM only needs to learn a low-dimensional latent flow space for motion generation, thus being more computationally efficient. We conduct comprehensive experiments on multiple datasets, where LFDM consistently outperforms prior arts. Furthermore, we show that LFDM can be easily adapted to new domains by simply finetuning the image decoder. Our code is available at https://github.com/nihaomiao/CVPR23_LFDM.

Keywords:

Computer science Artificial intelligence Optical flow Image warping Feature vector Feature (linguistics) Computer vision Flow (mathematics) Pattern recognition (psychology) Image (mathematics) Mathematics

Metrics

Cited By

16.74

FWCI (Field Weighted Citation Impact)

128

Refs

0.99

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Generative Adversarial Networks and Image Synthesis

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Advanced Vision and Imaging

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Human Pose and Action Recognition

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Conditional Image-to-Video Generation with Latent Flow Diffusion Models

Abstract

Metrics

Citation History

Topics

Related Documents

LaMD: Latent Motion Diffusion for Image-Conditional Video Generation

Conditional Text Image Generation with Diffusion Models

Conditional Latent Diffusion for Precision-Controllable Image Generation

Conditional Latent Diffusion for Precision-Controllable Image Generation

Latent Flow Diffusion for Deepfake Video Generation