ViGText: Deepfake Image Detection with Vision-Language Model Explanations and Graph Neural Networks

Ahmad ALBarqawi

Network and Distributed System Security (NDSS) Symposium 2026 · Day 2 · Multimedia Forensics

This talk presents **ViGText**, a deepfake image detection system that combines **vision-language model (VLM) explanations** with **graph neural networks (GNNs)** to achieve state-of-the-art generalization and robustness against both fine-tuned model variants and adversarial foundation model-based attacks. Unlike previous approaches that use simple concatenation of image and text features, ViGText builds integrated graphs that cross-reference forensic explanations with specific image regions, enabling context-aware relational inference.

AI review

A well-engineered deepfake detection system that achieves strong generalization through graph-based cross-referencing of VLM explanations with image features. The insight that context-aware integration matters more than explanation quality is valid, and the generalization results on unseen fine-tuned variants are impressive. However, this is defensive detection research with no offensive component, the system is expensive to deploy at scale, and the 4x4 fixed patching feels like a design limitation that could be exploited by adversaries.

Watch on YouTube