Understanding Masked Autoencoders via Hierarchical Latent Variable Models

Lingjing Kong; Quintín Martín Martín; Guangyi Chen; Eric P. Xing; Yuejie Chi; Louis–Philippe Morency; Kun Zhang

doi:10.1109/cvpr52729.2023.00765

ScienceGate Book Chapters

JOURNAL ARTICLE

Understanding Masked Autoencoders via Hierarchical Latent Variable Models

Lingjing Kong Quintín Martín Martín Guangyi Chen Eric P. Xing Yuejie Chi Louis–Philippe Morency Kun Zhang

Year: 2023 Pages: 7918-7928

DOI: 10.1109/cvpr52729.2023.00765

Get Full-Text PDF Get Analytical Report

Abstract

Masked autoencoder (MAE), a simple and effective self-supervised learning framework based on the reconstruction of masked image regions, has recently achieved prominent success in a variety of vision tasks. Despite the emergence of intriguing empirical observations on MAE, a theoretically principled understanding is still lacking. In this work, we formally characterize and justify existing empirical in-sights and provide theoretical guarantees of MAE. We formulate the underlying data-generating process as a hierarchical latent variable model, and show that under reasonable assumptions, MAE provably identifies a set of latent variables in the hierarchical model, explaining why MAE can extract high-level information from pixels. Further, we show how key hyperparameters in MAE (the masking ratio and the patch size) determine which true latent variables to be recovered, therefore influencing the level of semantic information in the representation. Specifically, extremely large or small masking ratios inevitably lead to low-level representations. Our theory offers coherent explanations of existing empirical observations and provides insights for potential empirical improvements and fundamental limitations of the masked-reconstruction paradigm. We conduct extensive experiments to validate our theoretical insights.

Keywords:

Latent variable Computer science Artificial intelligence Masking (illustration) Machine learning Hyperparameter Representation (politics) Latent variable model Variety (cybernetics) Autoencoder Variable (mathematics) Probabilistic latent semantic analysis Key (lock) Set (abstract data type) Mathematics Deep learning

Metrics

Cited By

3.64

FWCI (Field Weighted Citation Impact)

111

Refs

0.92

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Generative Adversarial Networks and Image Synthesis

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Domain Adaptation and Few-Shot Learning

Physical Sciences → Computer Science → Artificial Intelligence

Digital Media Forensic Detection

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Understanding Masked Autoencoders via Hierarchical Latent Variable Models

Abstract

Metrics

Citation History

Topics

Related Documents

Hierarchical Latent Variable Models

Hierarchical Gaussian process latent variable models

Diffusion Models as Masked Autoencoders

Detecting Hierarchical Changes in Latent Variable Models

Elucidating the Hierarchical Nature of Behavior with Masked Autoencoders