Can I Hear Your Face? Pervasive Attack on Voice Authentication Systems with a Single Face Image

Nan Jiang, Jun Han

33rd USENIX Security Symposium · Day 1 · USENIX Security '24

This talk introduces "Voice," a groundbreaking generative model that demonstrates a pervasive new attack vector against voice authentication systems. Traditionally, deepfake attacks on voice authentication require high-quality voice recordings of the victim, which are often difficult to obtain due plagued by issues like background noise or limited availability. Voice fundamentally shifts this paradigm by synthesizing voice recordings from a single, readily available face image of the target. This innovative approach significantly lowers the bar for attackers, enabling them to bypass widely used voice authentication platforms and activate voice assistants with unprecedented ease.

AI review

This research introduces 'Voice,' a genuinely groundbreaking attack model that synthesizes convincing voice deepfakes from a *single face image*, completely bypassing the need for voice recordings. It demonstrates pervasive vulnerability across major commercial authentication systems and voice assistants, achieving a 50% success rate. This fundamentally shifts the threat model for voice authentication and demands immediate industry attention.

Watch on YouTube