Bypassing AI Security Controls with Prompt Formatting

Nathan Kirk

fwd:cloudsec North America 2025 · Day 2 · Track 2 - Crestone

Nathan Kirk, Director at NR Labs and co-author of a blog post with AWS, presented research on **prompt formatting** -- a technique for bypassing AI guardrails by instructing the model to format its responses in non-standard ways that evade inline content filters. Unlike prompt injection, which targets the model's instruction processing, prompt formatting manipulates the model's output structure so that sensitive information passes through guardrails unrecognized. Kirk demonstrated the technique live against **Amazon Bedrock Guardrails' Sensitive Information Filters**, successfully extracting names from a protected knowledge base using Claude 3.7 Sonnet by requesting only the first four characters of each name appended with numbers. The technique is model-agnostic, flexible, and resistant to the standard defenses designed for prompt injection.

AI review

A clean, practical demonstration that AI guardrails for PII filtering are trivially bypassable by requesting non-standard output formats. The Python slice notation for progressive extraction is a nice touch. The live demo failure of the system prompt defense was the most honest and informative moment of the talk. Not deeply novel -- it's fundamentally 'WAF bypass for AI' -- but it's well-executed and immediately useful for anyone pentesting AI applications.

Watch on YouTube