There is No War in Ba Sing Se: A Global Analysis of Content Moderation in Large Language Models

Friedemann Lipphardt

Network and Distributed System Security (NDSS) Symposium 2026 · Day 3 · Privacy & Measurement

This research conducts the first large-scale global analysis of **LLM content moderation** across geographic locations, languages, and topics. By querying over **1,000 potentially unsafe prompts** across **15 LLMs** from **12 geographic locations** in **13 languages**, the researchers collected over **700,000 responses** and classified them as hard-moderated (complete refusal), soft-moderated (evasive/restricted response), or unmoderated. A custom-trained **DeBERTa classifier** proved far more effective at detecting soft moderation than off-the-shelf LLM judges, finding **50%+ soft moderation rates** where ChatGPT/Gemini detected only 10-20%.

AI review

A massive measurement study (700K+ responses, 15 LLMs, 12 countries, 13 languages) that reveals geographic and linguistic biases in LLM content moderation. The custom DeBERTa soft moderation classifier finding 3-5x more evasive responses than LLM judges is the key technical contribution. The politically-aligned soft moderation from Chinese models (denying Uyghur persecution, whitewashing Hong Kong) is the most striking finding.

Watch on YouTube