Safety Misalignment Against Large Language Models

Yichen Gong

Network and Distributed System Security (NDSS) Symposium 2025 · Day 1 · LLM Security

The proliferation of Large Language Models (LLMs) has ushered in an era of unprecedented capabilities, from sophisticated conversation and writing to complex coding tasks. However, this rapid advancement is accompanied by significant safety challenges, including the potential for LLMs to generate misinformation, perpetuate harmful stereotypes, or provide dangerous instructions. To mitigate these risks, responsible developers undertake **safety alignment**, a rigorous process involving techniques like supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) to imbue models with human-aligned values. This talk, "Safety Misalignment Against Large Language Models," presented by Dolong Ran from Tinua University, delves into the critical and underexplored vulnerability of these carefully aligned models: the concept of **safety misalignment**.

AI review

Competent, methodical research into LLM safety misalignment that covers useful ground — particularly the SSR attack and SSR-D defense contributions — but operates in a space that's now crowded enough that the novelty bar is high and this doesn't clear it cleanly. Solid NDSS-tier academic work that practitioners should be aware of, but it won't redefine how anyone thinks about the problem.

Watch on YouTube