Spectrogram Autoencoder for Zero-Shot Deepfake Detection
An autoencoder trained only on real speech flags deepfakes by reconstruction error — synthetic audio reconstructs poorly. Compares CNN vs. ViT-Tiny backbones and two spectrogram masks. Best config: 15.5% EER on ASVspoof 2019, 13.1% on In-the-Wild, fully zero-shot.