STAIR: Improving Safety Alignment with Introspective Reasoning

Yichi Zhang, Siyuan Zhang, Yao Huang, Zeyu Xia, Zhengwei Fang, Xiao Yang, Ranjie Duan, Dong Yan, Yinpeng Dong, Jun Zhu

International Conference on Machine Learning 2025 · Oral

In an era where large language models (LLMs) are rapidly integrating into critical applications—from medical advice to policy drafting—concerns regarding their safety and trustworthiness have escalated. Despite their powerful capabilities, LLMs can, and often do, generate harmful or illegal content, especially when subjected to sophisticated "jailbreaking" attacks. The talk "STAIR: Improving Safety Alignment with Introspective Reasoning," presented by Yichi Zhang from Tsinghua University and their co-authors, introduces a novel framework designed to address these pressing safety issues without compromising the models' utility.

AI review

STAIR proposes a three-stage training pipeline for LLM safety alignment that replaces reactive refusal behavior with deliberate chain-of-thought reasoning, using MCTS-guided self-improvement and a custom safety-helpfulness reward function. The empirical results are competitive and the framing is coherent, but the theoretical contributions are more modest than advertised — the reward design leans on informal desiderata rather than sharp formal results, the MCTS integration is an engineering combination rather than a principled new algorithm, and the central 'System 2 safety' intuition, while…