NeuroStrike: Neuron-Level Attacks on Aligned LLMs
Lichao Wu
Network and Distributed System Security (NDSS) Symposium 2026 · Day 1 · AI Security
This talk presents **NeuroStrike**, a neuron-level attack that jailbreaks aligned large language models by identifying and pruning **safety neurons** -- the specific neurons responsible for the model's refusal behavior when presented with malicious queries. By pruning just **0.5% of neurons in a targeted layer** (approximately 10,000 neurons out of 32 billion parameters), the researchers increase the average attack success rate from **12.1% to 76%**. The attack transfers to fine-tuned and distilled model variants, works against vision-language models, and extends to a black-box setting through neuron-activation-informed jailbreak prompt generation.
AI review
A devastating demonstration that LLM safety alignment is far more fragile than the industry acknowledges. Pruning 0.5% of neurons in a single layer removes safety guardrails with a free Google Colab GPU in under a minute. The transfer to fine-tuned, distilled, and vision-language models means the safety neurons are a structural weakness, not a per-model issue. The black-box extension via neuron-informed prompt generation is the cherry on top. This is the most impactful LLM security research I've seen at this conference.