Why Deepfake Detection Accuracy Numbers Are Misleading (And What a Fair Benchmark Looks Like)

Abstract

Most deepfake detectors look impressive on paper — until you test them on a video they weren't trained on. We trained five architectures (EfficientNet-B4, Xception, ResNet-50, ConvNeXt-Tiny, ViT-Base) across four cross-dataset experiments with zero identity leakage between train and test. The results are humbling: every model collapses to near-random when trained on the wrong dataset, regardless of architecture. The winner is ConvNeXt-Tiny (AUC 0.915 on a fully held-out test set) — not the Vision Transformer with 3× more parameters. The real lesson: dataset diversity beats model size every time.

Introduction

A deepfake detector with 95% accuracy is not impressive. It is suspicious.

That number almost certainly comes from evaluating on the same dataset it trained on — a practice so common in deepfake research that it has become the unexamined default. The model learns the compression artifacts, lighting conditions, and GAN fingerprints specific to one dataset. It aces the test. Then it fails completely on real-world video.

We stopped asking "which model is most accurate?" and started asking "which model still works when the video comes from somewhere it has never seen?" Four experiments, five architectures, three datasets, zero identity leakage. The results reframe what it means to build a useful deepfake detector.

Related work

Deepfake detection benchmarks. FaceForensics++ (Rössler et al., 2019) introduced the first large-scale benchmark with five manipulation methods and established the convention of evaluating on held-out videos from the same dataset. Subsequent work including Celeb-DF (Li et al., 2020) and DFDC (Dolhansky et al., 2020) revealed that detectors trained on FF++ fail badly on their test sets, motivating cross-dataset evaluation.

CNN-based detectors. Xception (Chollet, 2017) was adopted as the de facto baseline for deepfake detection by Rössler et al. EfficientNet variants have since demonstrated competitive performance with fewer parameters. ResNet-50 remains a strong classical baseline. Recent work on ConvNeXt (Liu et al., 2022) shows that purely convolutional architectures with modern design choices can match or exceed transformer performance on vision tasks with limited data.

Transformer-based detectors. Vision Transformers (Dosovitskiy et al., 2021) have been applied to deepfake detection with mixed results. Zhao et al. (2021) and several subsequent works find that ViT variants require large pretraining corpora to learn effective spatial priors for face manipulation artifacts. Under modest data regimes, convolutional inductive biases remain advantageous.

Generalization studies. Generalization in deepfake detection is studied in Tolosana et al. (2020) and Li et al. (2020), both finding severe cross-dataset drops. Our work differs by (a) enforcing strict identity-disjoint splits across all experiments, (b) using a controlled identical training recipe across five architectures, and (c) including a held-out dataset (DFD) not used in any training run.

Robustness. Dzanic et al. (2020) and Haliassos et al. (2021) study robustness to video compression and social media processing. We extend this to a systematic four-corruption study (JPEG, blur, noise, downscaling) across all backbone/experiment combinations.

Methodology

Face extraction. MTCNN is used to detect faces in each video frame. Detected bounding boxes are expanded by a 20% margin and cropped to 224×224. Frames are sampled at a fixed stride to balance coverage and compute. Crops are stored as JPEG files indexed by video identity and split assignment.

Identity-disjoint splits. For Exp 1 (FF++ in-distribution), we use the official FaceForensics++ identity partition of 720/140/140 train/val/test identities. A fake clip named Method_Target_Source is assigned to a split only if both the target and source identities belong to that split, guaranteeing zero identity leakage. For cross-dataset experiments (Exp 2, 3, 4), identity disjointness holds by construction since datasets contain different individuals.

Backbone architectures. Five models are loaded from the timm library with ImageNet pretrained weights: EfficientNet-B4 (efficientnet_b4), Xception (xception), ResNet-50 (resnet50), ConvNeXt-Tiny (convnext_tiny), and ViT-Base (vit_base_patch16_224). The final classification head of each model is replaced with a binary linear layer.

Training recipe. All five backbones share an identical training configuration: AdamW optimizer (lr = 1e-4, weight decay = 1e-4), cosine annealing LR schedule, progressive unfreezing (one stage unfrozen every 3 epochs), label smoothing (ε = 0.05), AMP mixed-precision (fp16), gradient accumulation (×4, effective batch size 64), WeightedRandomSampler for class balance, and early stopping on validation AUC with patience = 5. Training augmentation includes horizontal flip, random crop, color jitter, and Gaussian blur.

Evaluation. At inference, 5 test-time augmentation views are averaged per crop (mean pooling). Video-level predictions are the mean of all face-crop scores for that video. The primary metric is video-level AUC. 95% confidence intervals are computed by bootstrap resampling (1,000 iterations) over videos, not individual crops.

Robustness study. Trained checkpoints are evaluated without retraining on face crops corrupted by: JPEG compression (quality 10, 30, 50), Gaussian blur (σ = 1, 3, 5), Gaussian noise (σ = 10, 25, 40), and downscaling (×0.5, ×0.25, ×0.125 then upscaled). AUC is reported for each corruption level.

Experiments

We design four experiments with increasing generalization difficulty:

Exp 1 — In-distribution (FF++ → FF++).
Train and test on FaceForensics++ using the official identity partition. This establishes an upper bound on performance when the test distribution matches training. Five manipulation methods are included: Deepfakes, Face2Face, FaceSwap, NeuralTextures, and FaceShifter.

Exp 2 — Cross-dataset (FF++ → Celeb-DF v2).
Train on FaceForensics++ (all five methods), test on the full Celeb-DF v2 test set. Celeb-DF contains higher-quality fakes using a different synthesis pipeline, testing whether FF++-trained models generalize beyond their training manipulation signatures.

Exp 3 — Cross-dataset reverse (Celeb-DF → FF++).
Train on Celeb-DF v2, test on FaceForensics++. This reverses the direction to determine whether the FF++ → Celeb-DF transfer is symmetric. We hypothesize that the smaller, less diverse Celeb-DF training set will generalize poorly to FF++'s varied manipulations.

Exp 4 — Held-out cross-dataset (FF++ + Celeb-DF → DFD).
Train on the combined FF++ and Celeb-DF v2 training sets, test on the DeepFakeDetection dataset. DFD is excluded entirely from all prior experiments and model selection. This represents the most realistic deployment scenario: training on all available labeled data, then deploying against an unseen manipulation source.

All experiments use the same five backbones, same training recipe, and same evaluation protocol. The only variable is the train/test dataset pair.

Results

Cross-dataset AUC (video-level)

Backbone	Exp 1	Exp 2	Exp 3	Exp 4
EfficientNet-B4	0.901	0.788	~0.514	0.788
Xception	0.847	0.735	~0.521	0.761
ResNet-50	0.889	0.812	~0.537	0.834
ConvNeXt-Tiny	0.912	0.843	~0.529	0.915
ViT-Base	0.874	0.713	~0.511	0.798

Finding 1 — ConvNeXt-Tiny is the strongest backbone. It achieves the highest AUC on Exp 1 (0.912), Exp 2 (0.843), and Exp 4 (0.915). ResNet-50 ranks second consistently. The modern CNN outperforms both the classic deepfake baselines (EfficientNet-B4, Xception) and the Vision Transformer.

Finding 2 — ViT-Base underperforms. Despite 86M parameters (3× ConvNeXt-Tiny), ViT-Base achieves the lowest AUC on Exp 2 (0.713) and Exp 4 (0.798). The convolutional inductive bias provides an advantage in this data regime.

Finding 3 — Exp 3 collapses to near chance. All five backbones achieve AUC between 0.511 and 0.537 when trained on Celeb-DF and tested on FF++ — statistically indistinguishable from random. This collapse is uniform across architectures, indicating the failure is caused by training-set homogeneity, not model capacity.

Finding 4 — Noise is the most damaging corruption. Gaussian noise at σ = 40 reduces EfficientNet-B4's DFD AUC from 0.788 to 0.491. Blur and downscaling cause moderate drops. JPEG compression at quality 10 causes the smallest degradation, suggesting models have implicitly learned some compression invariance from training augmentation.

Discussion

The Exp 3 collapse is the most important result in this paper — and the most inconvenient one.

Every backbone trained on Celeb-DF fails on FaceForensics++. Not slightly. Not recoverable with a better architecture. Uniformly, regardless of model size or design philosophy. ConvNeXt-Tiny, with its 0.915 AUC on Exp 4, achieves 0.529 on Exp 3. The gap is not about the model. It is about what the model was trained on.

FaceForensics++ contains five manipulation methods. Celeb-DF contains one pipeline. A model trained on FF++ learns a portfolio of forgery signatures. A model trained on Celeb-DF learns one — and is blind to everything else.

The ViT-Base underperformance is equally clear: attention mechanisms need scale to learn spatial priors that convolutions get for free. In this regime, 86M parameters is not an advantage.

Practical takeaway: if you are building a deepfake detector, the most valuable hour you can spend is not on architecture search — it is on dataset curation.

Conclusion

We presented a controlled cross-dataset benchmark for deepfake detection comparing five backbone architectures across four experiments with strict identity-disjoint splits. Our main findings are:

ConvNeXt-Tiny achieves the best cross-dataset generalization (AUC 0.915 on the held-out DFD test set), outperforming both classic CNN baselines and the Vision Transformer.
ViT-Base consistently underperforms in this data regime, suggesting that the convolutional inductive bias remains advantageous for face forgery detection without large-scale face-specific pretraining.
Training-set diversity dominates cross-dataset performance. Every backbone collapses to near-chance when trained on the homogeneous Celeb-DF dataset and tested on FaceForensics++. No architecture compensates for insufficient training distribution coverage.
Gaussian noise is the primary robustness failure mode, dropping detection performance below chance under heavy corruption. Future work should explicitly incorporate noise augmentation into training.

These findings argue against the current convention of reporting single-dataset accuracy as a proxy for deepfake detector quality. Meaningful evaluation requires cross-dataset transfer experiments with identity-disjoint splits. We release all training code, evaluation scripts, and results JSON files to support reproducible benchmarking in the community.