Can we leverage the audiovisual information already present in video to improve self-supervised representation learning? To answer this question, we study various pre-training architectures and objectives within the masked autoencoding framework, motivated by the success of similar methods in natural language and image understanding. We show that we can achieve significant improvements on audiovisual downstream classification tasks, surpassing the state-of-the-art on VGGSound and AudioSet. Furthermore, we can leverage our audiovisual pretraining scheme for multiple unimodal downstream tasks using a single audiovisual pretrained model. We additionally demonstrate the transferability of our representations, achieving state-of-the-art audiovisual results on Epic Kitchens without pre-training specifically for this dataset.
Xuecheng WuJunxiao XueXinyi YinYuejiang ShiLiangyu FuDanlei HuangYifan WangZhang JiaJiayu NieJun Wang
Yijun TianKaiwen DongChunhui ZhangChuxu ZhangNitesh V. Chawla
Chen WeiKarttikeya MangalamPo-Yao HuangYanghao LiHaoqi FanXu HuHuiyu WangCihang XieAlan YuilleChristoph Feichtenhofer
Ahmed Adel AttiaCarol Espy-Wilson