With the rapid development of facial tampering techniques, the deepfake detection task has attracted widespread social concerns. Most existing video-based methods adopt temporal convolution to learn temporal discontinuities directly, where they might neglect to explore both local detail mutation and inconsistent global expression semantics in the temporal dimension. This makes it difficult to learn more discriminative forgery cues. To mitigate this issue, we introduce a novel deepfake video detection framework specifically designed to capture fine-grained traces of tampering. Concretely, we first present a Multi-layered Feature Extraction module (MFE) that constructs comprehensive spatio-temporal representations by stitching different levels of features together. Afterward, we propose a Bidirectional temporal Artifact Enhancement module (BAE), which exploits local differences between adjacent frames to enhance frame-level features. Moreover, we present a Cross temporal Stride Aggregation strategy (CSA) to mine inconsistent global semantics and adaptively obtain multi-timescale representations. Extensive experiments on several benchmarks demonstrate that the proposed method outperforms state-of-the-art performance compared to other competitive approaches.
Staffy KingraNaveen AggarwalNirmal Kaur
Yingqi GaoZhiling LuoShiqian ChenWei Zhou
Ruyi ChangHaopeng WangZhitian ZhangDejiao HuangShuai Guo
Jorge PessoaHelena AidosPedro TomásMário A. T. Figueiredo
Jianping LiJing SunYanyi MengKang Xu