Multi-modal Deepfake Detection via Multi-task Audio-Visual Prompt Learning

Hui Miao; Yuanfang Guo; Zeming Liu; Yunhong Wang

doi:10.1609/aaai.v39i1.32042

ScienceGate Book Chapters

JOURNAL ARTICLE

Multi-modal Deepfake Detection via Multi-task Audio-Visual Prompt Learning

Hui Miao Yuanfang Guo Zeming Liu Yunhong Wang

Year: 2025 Journal: Proceedings of the AAAI Conference on Artificial Intelligence Vol: 39 (1)Pages: 612-621 Publisher: Association for the Advancement of Artificial Intelligence

DOI: 10.1609/aaai.v39i1.32042

Get Full-Text PDF Get Analytical Report

Abstract

With the malicious use and dissemination of multi-modal deepfake videos, researchers start to investigate multi-modal deepfake detection. Unfortunately, most of the existing methods tune all the parameters of the deep network with limited speech video datasets and are trained under coarse-grained consistency supervision, which hinders their generalization ability in practical scenarios. To solve these problems, in this paper, we propose the first multi-task audio-visual prompt learning method for multi-modal deepfake video detection, by exploiting multiple foundation models. Specifically, we construct a two-stream multi-task learning architecture and propose sequential visual prompts and short-time audio prompts to extract multi-modal features, which are aligned at the frame level and utilized in subsequent fine-grained feature matching and fusion. Due to the natural alignment of visual content and audio signal in real data, we propose a frame-level cross-modal feature matching loss function to learn the fine-grained audio-visual consistency. Comprehensive experiments demonstrate the effectiveness and superior generalization ability of our method against the state-of-the-art methods.

Keywords:

Computer science Task (project management) Modal Audio visual Artificial intelligence Speech recognition Natural language processing Human–computer interaction Multimedia Engineering Chemistry

Metrics

Cited By

0.00

FWCI (Field Weighted Citation Impact)

Refs

0.36

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Topics

Digital Media Forensic Detection

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Anomaly Detection Techniques and Applications

Physical Sciences → Computer Science → Artificial Intelligence

Generative Adversarial Networks and Image Synthesis

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Multi-modal Deepfake Detection via Multi-task Audio-Visual Prompt Learning

Abstract

Metrics

Topics

Related Documents

Frequency-Enhanced Multi-Modal Consistency Learning for Audio-Visual Deepfake Detection

Vision-Audio based Deepfake Detection according to Multi-modal Learning

Multi-Modal Deepfake Detection System Using Visual, Audio, and Temporal Cues

DF-CLIP: Adapting Visual-Language Models for Generalizable Deepfake Detection Using Multi-Modal Prompt Tuning

Visual Prompt Multi-Modal Tracking