Why Deepfake Detection Accuracy Numbers Are Misleading (And What a Fair Benchmark Looks Like)