What Is Prompt Injection, Really?
Here's the thing: prompt injection is probably the biggest security headache in the LLM world right now. It happens when someone crafts input that tricks your model into doing something it shouldn't — running unauthorized commands, leaking your system prompt, or blowing past safety guardrails. Unlike SQL injection, where you can parameterize your way out of trouble, prompt injection exploits something fundamental about how language models work. They simply can't tell the difference between your instructions and an attacker's instructions buried in the input.
The scary part is that the risk scales with your model's power. A simple chatbot that just generates text? Limited damage. But an AI agent with database access, API keys, and file system permissions? That's a jackpot for attackers. If someone can hijack your agent through injected instructions, they effectively gain access to every system that agent can touch.
Prompt injection isn't a bug you can patch — it's a fundamental property of how today's LLMs process text. You need multiple layers of defense, not a silver bullet.
Direct vs. Indirect Injection
Direct prompt injection is the straightforward kind. The attacker types something like "Ignore all previous instructions and..." right into the chat. More sophisticated versions use role-playing tricks, encoding schemes, or multi-turn conversations that slowly steer the model off course. Modern LLMs handle the basic stuff better now, but creative attackers keep finding new angles.
Indirect injection is where things get really nasty. Instead of typing the attack directly, the attacker hides malicious instructions inside data your model will process — a web page, a document, an email, a database record, an API response. When your RAG pipeline retrieves a page with hidden instructions, your model might follow them as if you wrote them yourself. I've seen this in the wild, and it's genuinely hard to detect.
Attack Patterns You'll See in the Wild
- Instruction override: Direct commands like "ignore your system prompt" — crude but still effective against some models
- Context manipulation: Slowly shifting the conversation to normalize unauthorized behavior over multiple turns
- Encoding attacks: Using Base64, ROT13, or Unicode tricks to sneak payloads past input filters
- Payload splitting: Spreading the attack across multiple inputs or retrieved documents so no single piece looks suspicious
- Virtualization: Asking the model to role-play as a different AI that has no safety restrictions
- Indirect injection via data: Planting instructions in documents, web pages, or database records that your model will consume
First Line of Defense: Input Sanitization
Your first move should be sanitizing inputs before they ever reach the model. Strip known injection patterns, enforce length limits, validate formats, and — if you have the budget — run a classifier model trained specifically to catch injection attempts. Will this stop everything? No. But it raises the bar significantly and filters out the low-effort attacks that make up the majority of attempts.
class InputSanitizer:
INJECTION_PATTERNS = [
r"ignore (all |any )?(previous|prior|above) (instructions|prompts)",
r"you are now",
r"new instructions:",
r"system prompt:",
r"\[INST\]",
r"<\|im_start\|>system",
]
def sanitize(self, user_input: str) -> tuple[str, float]:
risk_score = 0.0
cleaned = user_input
for pattern in self.INJECTION_PATTERNS:
if re.search(pattern, cleaned, re.IGNORECASE):
risk_score += 0.3
cleaned = re.sub(pattern, "[FILTERED]", cleaned, flags=re.IGNORECASE)
# Length-based risk adjustment
if len(cleaned) > 5000:
risk_score += 0.1
return cleaned, min(risk_score, 1.0)A basic input sanitizer — pattern matching plus risk scoring to catch the obvious stuff
Second Layer: Output Validation
In practice, some injections will get through your input filters. That's just reality. So you need to inspect what comes out the other side too. Check outputs against expected formats, scan for leaked system prompts or sensitive data, make sure tool calls stay within allowed boundaries, and run a separate classifier to flag suspicious responses. Think of it as a safety net for when your first line of defense has a bad day.
Third Layer: Privilege Separation
This one is huge and often overlooked. Follow the principle of least privilege religiously. Every agent and tool should have the absolute minimum permissions needed to do its job. Critical operations should always require explicit user confirmation — no matter what the model says. Scope network access, file system access, and API credentials as tightly as you can. And run your agents in sandboxed environments so a compromised model can't take down the whole system.
Putting It All Together: Defense in Depth
No single layer will save you. The real power comes from combining everything — input sanitization, output validation, privilege separation, monitoring, and rate limiting. Each layer catches what slips through the others. You want your security posture to degrade gracefully, not collapse like a house of cards when one defense fails.
class DefenseOrchestrator:
def __init__(self):
self.input_sanitizer = InputSanitizer()
self.output_validator = OutputValidator()
self.permission_checker = PermissionChecker()
self.rate_limiter = RateLimiter(max_actions=10, window_seconds=60)
async def process_request(self, user_input: str, context: dict) -> Response:
# Layer 1: Input sanitization
cleaned_input, risk_score = self.input_sanitizer.sanitize(user_input)
if risk_score > 0.7:
return Response.blocked("Input flagged as potential injection")
# Layer 2: Rate limiting
if not self.rate_limiter.allow(context["user_id"]):
return Response.blocked("Rate limit exceeded")
# Layer 3: Generate response with constrained context
response = await self.llm.generate(cleaned_input, context)
# Layer 4: Output validation
validated = self.output_validator.validate(response, context)
if not validated.safe:
return Response.blocked("Output failed safety validation")
# Layer 5: Permission check for any tool calls
for tool_call in response.tool_calls:
if not self.permission_checker.allowed(tool_call, context):
return Response.blocked(f"Unauthorized action: {tool_call.name}")
return validated.responseThe full defense orchestrator — five layers working together to keep things locked down
Monitoring: Your Eyes and Ears
Even with all these defenses, you need eyes on the system. Log every input, output, and tool call for post-hoc analysis. Set up alerts for the weird stuff — unusually long inputs, unexpected tool calls, one user hammering your API, or outputs containing patterns that look like leaked secrets. And build circuit breakers into your system. When something goes wrong, you want to instantly revoke agent permissions and fall back to safe defaults. Don't wait for a human to notice.
Warning
Don't rely on the LLM itself to detect prompt injection. The same thing that makes models vulnerable — their eagerness to follow instructions in context — also makes them terrible judges of whether their own context has been compromised.
What's Next?
The cat-and-mouse game between attackers and defenders will keep evolving as LLMs get more capable and more deeply woven into critical systems. There's promising research happening — formal verification of LLM outputs, hardware-level isolation for AI workloads, and architectures that fundamentally separate instruction processing from data processing. But we're not there yet. For now, defense in depth is your best bet. Layer your defenses, assume each one will eventually fail, and design your system to handle that gracefully.