Abstract

There has been a longstanding belief that generation can facilitate a true understanding of visual data. In line with this, we revisit generatively pre-training visual representations in light of recent interest in denoising diffusion models. While directly pre-training with diffusion models does not produce strong representations, we condition diffusion models on masked input and formulate diffusion models as masked autoencoders (DiffMAE). Our approach is capable of (i) serving as a strong initialization for downstream recognition tasks, (ii) conducting high-quality image inpainting, and (iii) being effortlessly extended to video where it produces state-of-the-art classification accuracy. We further perform a comprehensive study on the pros and cons of design choices and build connections between diffusion models and masked autoencoders. Project page.

Keywords:
Initialization Computer science Inpainting Artificial intelligence Diffusion Noise reduction Line (geometry) Noise (video) Computer vision Pattern recognition (psychology) Image (mathematics) Machine learning Mathematics

Metrics

27
Cited By
4.91
FWCI (Field Weighted Citation Impact)
106
Refs
0.95
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Generative Adversarial Networks and Image Synthesis
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Music and Audio Processing
Physical Sciences →  Computer Science →  Signal Processing
Aesthetic Perception and Analysis
Life Sciences →  Neuroscience →  Cognitive Neuroscience
© 2026 ScienceGate Book Chapters — All rights reserved.