HolisticDFD: Infusing Spatiotemporal Transformer Embeddings for Deepfake Detection

Muhammad Anas Raza, Khalid Mahmood

Deepfakes, or synthetic audiovisual media developed with the intent to deceive, are growing increasingly prevalent. Existing methods, employed independently as images/patches or jointly as tubelets, have, up to this point, typically focused on spatial or spatiotemporal inconsistencies. However, the evolving nature of deepfakes demands a holistic approach. Inspection of a given multimedia sample with the intent to validate its authenticity, without adding significant computational overhead has, to date, not been fully explored in the literature. In addition, no work has been done on the impact of different inconsistency dimensions in a single framework. This paper tackles the deepfake detection problem holistically. HolisticDFD, a novel, transformer-based, deepfake detection method which is both lightweight and compact, intelligently combines embeddings from the spatial, temporal and spatiotemporal dimensions to separate deepfakes from bonafide videos. The proposed system achieves 0.926 AUC on the DFDC dataset using just 3% of the parameters used by state-ofthe-art detectors. An evaluation against other datasets shows the efficacy of the proposed framework, and an ablation study shows that the performance of the system gradually improves as embeddings with different data representations are combined. An implementation of the proposed model is available at: https://github.com/smileslab/deepfake-detection/.

More Like this