Audiovisual Masked Autoencoders

Mariana-Iuliana Georgescu; Eduardo Fonseca; Radu Tudor Ionescu; Mario Lučić; Cordelia Schmid; Anurag Arnab

doi:10.1109/iccv51070.2023.01479

ScienceGate Book Chapters

JOURNAL ARTICLE

Audiovisual Masked Autoencoders

Mariana-Iuliana Georgescu Eduardo Fonseca Radu Tudor Ionescu Mario Lučić Cordelia Schmid Anurag Arnab

Year: 2023 Pages: 16098-16108

DOI: 10.1109/iccv51070.2023.01479

Get Full-Text PDF Get Analytical Report

Abstract

Can we leverage the audiovisual information already present in video to improve self-supervised representation learning? To answer this question, we study various pre-training architectures and objectives within the masked autoencoding framework, motivated by the success of similar methods in natural language and image understanding. We show that we can achieve significant improvements on audiovisual downstream classification tasks, surpassing the state-of-the-art on VGGSound and AudioSet. Furthermore, we can leverage our audiovisual pretraining scheme for multiple unimodal downstream tasks using a single audiovisual pretrained model. We additionally demonstrate the transferability of our representations, achieving state-of-the-art audiovisual results on Epic Kitchens without pre-training specifically for this dataset.

Keywords:

Leverage (statistics) Computer science Transferability Artificial intelligence EPIC Machine learning Natural language processing Transfer of learning Feature learning Scheme (mathematics) Representation (politics) Speech recognition

Metrics

Cited By

10.20

FWCI (Field Weighted Citation Impact)

Refs

0.98

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Speech and Audio Processing

Physical Sciences → Computer Science → Signal Processing

Music and Audio Processing

Physical Sciences → Computer Science → Signal Processing

Digital Media Forensic Detection

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Audiovisual Masked Autoencoders

Abstract

Metrics

Citation History

Topics

Related Documents

Scalable Audiovisual Masked Autoencoders for Efficient Affective Video Facial Analysis

CMAE:Channel-Masked Autoencoders

Heterogeneous Graph Masked Autoencoders

Diffusion Models as Masked Autoencoders

Masked Autoencoders are Articulatory Learners