SK-VQA: Synthetic Knowledge Generation at Scale for Training Context-Augmented Multimodal LLMs

Xin Su, Man Luo, Kris Pan, Tien Pei Chou, Vasudev Lal, Phillip Howard

International Conference on Machine Learning 2025 · Oral

In the rapidly evolving landscape of artificial intelligence, **multimodal Large Language Models (LLMs)** have demonstrated remarkable capabilities in understanding and generating content across various modalities. However, a persistent challenge remains: their struggle with knowledge-intensive tasks that demand precise, external information beyond their parametric memory. This talk introduces **SK-VQA**, a groundbreaking dataset designed to address this limitation by enabling the training of context-augmented multimodal LLMs. The dataset facilitates **Retrieval-Augmented Generation (RAG)** for multimodal models, a crucial step towards building more accurate and less hallucinatory AI systems.

AI review

SK-VQA is a competent dataset construction effort with real engineering value — it is large, filtered carefully, and produces a benchmark that is genuinely harder for current multimodal LLMs than prior KB-VQA datasets. But the talk is presented as a theoretical and empirical breakthrough when it is, at its core, a data paper. The central claims — that synthetic GPT-4 data rivals real data, and that SK-VQA training improves out-of-domain generalization — are asserted on fine-tuning experiments with no causal mechanism proposed or tested. The 'surprising' synthetic-vs-real finding is presented…