Vision Transformers are Parameter-Efficient Audio-Visual Learners

Yan-Bo Lin; Yi-Lin Sung; Jie Lei; Mohit Bansal; Gedas Bertasius

doi:10.1109/cvpr52729.2023.00228

ScienceGate Book Chapters

JOURNAL ARTICLE

Vision Transformers are Parameter-Efficient Audio-Visual Learners

Yan-Bo Lin Yi-Lin Sung Jie Lei Mohit Bansal Gedas Bertasius

Year: 2023 Pages: 2299-2309

DOI: 10.1109/cvpr52729.2023.00228

Get Full-Text PDF Get Analytical Report

Abstract

Vision transformers (ViTs) have achieved impressive results on various computer vision tasks in the last several years. In this work, we study the capability of frozen ViTs, pretrained only on visual data, to generalize to audio-visual data without finetuning any of its original parameters. To do so, we propose a latent audio-visual hybrid (LAVISH) adapter that adapts pretrained ViTs to audio-visual tasks by injecting a small number of trainable parameters into every layer of a frozen ViT. To efficiently fuse visual and audio cues, our LAVISH adapter uses a small set of latent tokens, which form an attention bottleneck, thus, eliminating the quadratic cost of standard cross-attention. Compared to the existing modality-specific audio-visual methods, our approach achieves competitive or even better performance on various audio-visual tasks while using fewer tunable parameters and without relying on costly audio pretraining or external audio encoders. Our code is available at https://genjib.github.io/project_page/LAVISH/

Keywords:

Computer science Audio visual Adapter (computing) Transformer Encoder Artificial intelligence Visualization Speech recognition Computer vision Multimedia Computer hardware

Metrics

Cited By

19.06

FWCI (Field Weighted Citation Impact)

135

Refs

1.00

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Music and Audio Processing

Physical Sciences → Computer Science → Signal Processing

Speech and Audio Processing

Physical Sciences → Computer Science → Signal Processing

Hearing Loss and Rehabilitation

Life Sciences → Neuroscience → Cognitive Neuroscience

Vision Transformers are Parameter-Efficient Audio-Visual Learners

Abstract

Metrics

Citation History

Topics

Related Documents

Siamese Vision Transformers are Scalable Audio-Visual Learners

Towards Efficient Audio-Visual Learners via Empowering Pre-trained Vision Transformers with Cross-Modal Adaptation

MA-AVT: Modality Alignment for Parameter-Efficient Audio-Visual Transformers

Parameter-Efficient Model Adaptation for Vision Transformers

Parameter-Efficient Transfer Learning of Audio Spectrogram Transformers