Ahmet Demir is an AI Solutions Architect and ML Engineer based in Fethiye, Muğla, Turkey. He specializes in building production-grade generative AI systems, LLM integrations, ML pipelines, and full-stack applications for enterprise clients worldwide.

What AI services does Ahmet Demir offer?

Ahmet Demir offers Generative AI & LLM development, ML pipeline design, AI system architecture & MLOps, full-stack app development, and AI strategy consulting for enterprise clients worldwide.

Where is Ahmet Demir based?

Ahmet Demir is based in Fethiye, Muğla, Turkey, and works remotely with clients across the globe.

What technologies does Ahmet Demir work with?

Ahmet Demir works with Python, PyTorch, TensorFlow, LangChain, OpenAI, Hugging Face, Next.js, React, TypeScript, Docker, Kubernetes, AWS, GCP, and Azure. He specializes in RAG systems, fine-tuning LLMs, MLOps pipelines, and agentic AI systems.

How can I contact Ahmet Demir?

You can reach Ahmet Demir via email at d3mir.ahmet@gmail.com or through the contact form on ahmetdemir.tech. He is available for new projects and consulting engagements.

Agent Guardrails: How to Keep Autonomous AI From Going Off the Rails

Why You Can't Ship Agents Without Guardrails

Autonomous AI agents are a whole different ballgame compared to traditional software. Instead of following predictable code paths, they make decisions, call tools, and produce outputs you can't always anticipate. That unpredictability is exactly what makes them useful — but it's also what makes them dangerous when left unchecked. Imagine an agent with access to your database, email API, and payment gateway. If it misreads an instruction or gets hit with a prompt injection attack, it can do real damage in seconds.

Guardrails are the safety nets that keep agent behavior within acceptable bounds. They work across multiple layers — input, output, action, and system level — so that even when the model makes mistakes, the system as a whole fails gracefully. Think of it like building a bridge: you don't just trust the steel to hold. You add redundant supports, load sensors, and emergency shutoffs. AI agents need that same defense-in-depth mindset.

Here's the thing: if you deploy agents without guardrails, it's not a question of if something goes wrong — it's when. I've seen production incidents ranging from data leaks via prompt injection to runaway API calls racking up thousands of dollars in minutes. The cost of adding guardrails is tiny compared to cleaning up after these failures.

Guardrails aren't a limitation on what your agent can do — they're a prerequisite. An agent you can't trust to operate safely will never get the autonomy it needs to be truly useful.

Input Validation and Prompt Shields

Your first line of defense is input validation. Every message, document, or data source entering the agent's context needs to be screened for adversarial content, malformed instructions, and policy violations. Prompt injection — where an attacker hides instructions inside seemingly harmless content — is still one of the biggest threats out there. If you're running a RAG agent that pulls in external documents, you're especially vulnerable since any document in the corpus could carry injected directives.

Prompt shields work by classifying incoming content before it hits the agent's reasoning loop. Modern approaches combine rule-based pattern matching with lightweight classifier models trained to spot injection attempts. The rule-based layer catches known patterns — phrases like "ignore previous instructions," base64 or unicode encoding tricks, and delimiter manipulation. The classifier layer handles novel attacks that slip past static rules by evaluating the semantic intent behind the input.

You'll also want to enforce structural constraints on inputs. If the agent expects a user query, check for length limits, valid character encoding, and the absence of control characters. For tool results, run schema validation to make sure external data matches expected types and ranges before the agent processes it. These basic checks sound boring, but they prevent a surprising number of production failures.

python

import re
from dataclasses import dataclass
from enum import Enum

class ThreatLevel(Enum):
    SAFE = "safe"
    SUSPICIOUS = "suspicious"
    BLOCKED = "blocked"

@dataclass
class ValidationResult:
    threat_level: ThreatLevel
    reasons: list[str]
    sanitized_input: str | None

class PromptShield:
    """Multi-layered input validation for agent systems."""

    # Known injection patterns
    INJECTION_PATTERNS = [
        r"ignore\s+(all\s+)?previous\s+instructions",
        r"you\s+are\s+now\s+(a|an)\s+",
        r"system\s*:\s*",
        r"\[INST\]|\[/INST\]|<<SYS>>",
        r"act\s+as\s+(if|though)?\s+",
    ]

    # Maximum allowed input length (characters)
    MAX_INPUT_LENGTH = 10_000

    def __init__(self, classifier=None):
        self._patterns = [re.compile(p, re.IGNORECASE) for p in self.INJECTION_PATTERNS]
        self._classifier = classifier  # Optional ML-based classifier

    def validate(self, user_input: str) -> ValidationResult:
        """Validate user input through multiple defense layers."""
        reasons: list[str] = []

        # Layer 1: Length and encoding checks
        if len(user_input) > self.MAX_INPUT_LENGTH:
            return ValidationResult(
                threat_level=ThreatLevel.BLOCKED,
                reasons=["Input exceeds maximum allowed length"],
                sanitized_input=None,
            )

        # Layer 2: Pattern-based injection detection
        for pattern in self._patterns:
            if pattern.search(user_input):
                reasons.append(f"Matched injection pattern: {pattern.pattern}")

        # Layer 3: Unicode and encoding attack detection
        decoded = self._check_encoding_attacks(user_input)
        if decoded != user_input:
            reasons.append("Detected encoded content (potential obfuscation)")

        # Layer 4: ML classifier (if available)
        if self._classifier and self._classifier.is_injection(user_input):
            reasons.append("ML classifier flagged as potential injection")

        # Determine threat level
        if len(reasons) >= 2:
            return ValidationResult(ThreatLevel.BLOCKED, reasons, None)
        elif len(reasons) == 1:
            return ValidationResult(ThreatLevel.SUSPICIOUS, reasons, decoded)
        return ValidationResult(ThreatLevel.SAFE, [], user_input)

    def _check_encoding_attacks(self, text: str) -> str:
        """Detect and decode common encoding-based attacks."""
        import base64
        # Check for base64-encoded segments
        b64_pattern = re.compile(r"[A-Za-z0-9+/]{20,}={0,2}")
        for match in b64_pattern.finditer(text):
            try:
                decoded = base64.b64decode(match.group()).decode("utf-8")
                # If decoded text contains injection patterns, flag it
                for pattern in self._patterns:
                    if pattern.search(decoded):
                        return decoded
            except Exception:
                pass
        return text

A multi-layered prompt shield that combines pattern matching, encoding detection, and ML classification

Output Filtering and Content Moderation

Even with rock-solid input validation, you still need to scrutinize what the agent produces. Language models can generate content that violates policies, leaks sensitive data, or includes hallucinated facts that mislead users. Output filtering is your last checkpoint before any response reaches the user or triggers a downstream action.

Content moderation classifiers evaluate agent outputs across several dimensions: toxicity, personally identifiable information (PII), confidential data patterns, and domain-specific policy violations. For PII detection, combining named entity recognition with regex patterns for structured data (credit card numbers, SSNs, email addresses) gives you solid coverage. The big architectural question is whether to block the entire response or just redact the offending parts. Redaction keeps the useful bits but risks incomplete removal. Blocking is safer but hurts the user experience.

Hallucination detection is trickier. Factual grounding — checking that the agent's claims are actually backed by source documents — means comparing output statements against retrieved context. Techniques range from simple string overlap metrics to entailment classifiers that judge whether the source text logically supports each claim. In high-stakes domains like healthcare or legal advice, the standard practice is to require explicit source citations for every factual claim and flag anything uncited for human review.

Don't forget format constraints either. If your agent should return JSON, validate the output against the schema before passing it downstream. If it generates code, run static analysis to catch obvious security issues like SQL injection or command injection patterns. These structural checks complement the semantic checks from your content classifiers.

Warning

Don't rely on the system prompt alone to prevent harmful outputs. A sufficiently creative adversarial input can override system prompt instructions. Programmatic output filters that run outside the model's context are the only reliable defense against policy violations.

Action Boundaries and Permission Systems

When your agent can do things with real-world consequences — sending emails, modifying databases, processing payments — the permission system becomes your most critical guardrail. The principle of least privilege matters even more for AI agents than for human users, because agents act at machine speed without the natural hesitation or judgment a human would apply.

A good permission system defines three dimensions for each action: scope, rate, and approval level. Scope controls what the action can touch — a database tool might be limited to read-only access on specific tables. Rate limits cap the number or total value of actions within a time window — maybe your email tool can only send ten messages per hour. Approval level determines whether the action runs autonomously, goes through async review, or needs real-time human sign-off.

What I've seen work well in production is a tiered model. Low-risk actions like reading data or doing calculations run without human oversight. Medium-risk actions like sending notifications or updating non-critical records go through with logging and post-hoc review. High-risk actions — financial transactions, data deletion, emails to new recipients — require explicit human approval before they execute. And here's the important part: a security review should determine which tier each action falls into, not the agent itself.

python

from enum import Enum
from dataclasses import dataclass
from typing import Any, Callable, Awaitable
import time

class ApprovalLevel(Enum):
    AUTO = "auto"           # Execute without human oversight
    LOG = "log"             # Execute and log for review
    CONFIRM = "confirm"     # Require human approval before execution

@dataclass
class ActionPolicy:
    approval_level: ApprovalLevel
    max_calls_per_hour: int
    max_value_per_call: float | None = None  # For financial actions
    allowed_scopes: list[str] | None = None  # Restrict target resources

class ActionGuardrail:
    """Enforce permission boundaries on agent actions."""

    def __init__(self, policies: dict[str, ActionPolicy]):
        self._policies = policies
        self._call_log: dict[str, list[float]] = {}  # action -> timestamps

    async def execute(
        self,
        action_name: str,
        parameters: dict[str, Any],
        executor: Callable[..., Awaitable[Any]],
        approval_callback: Callable[[str, dict], Awaitable[bool]] | None = None,
    ) -> dict[str, Any]:
        """Execute an action with full guardrail enforcement."""
        policy = self._policies.get(action_name)
        if policy is None:
            # Deny unknown actions by default
            return {"status": "denied", "reason": "No policy defined for this action"}

        # Check scope restrictions
        if policy.allowed_scopes is not None:
            target = parameters.get("target", "")
            if not any(target.startswith(scope) for scope in policy.allowed_scopes):
                return {"status": "denied", "reason": f"Target '{target}' outside allowed scopes"}

        # Check rate limits
        now = time.time()
        recent_calls = [
            t for t in self._call_log.get(action_name, [])
            if now - t < 3600  # Within the last hour
        ]
        if len(recent_calls) >= policy.max_calls_per_hour:
            return {"status": "denied", "reason": "Rate limit exceeded"}

        # Check value limits for financial actions
        if policy.max_value_per_call is not None:
            value = parameters.get("amount", 0)
            if value > policy.max_value_per_call:
                return {"status": "denied", "reason": f"Amount {value} exceeds limit {policy.max_value_per_call}"}

        # Check approval level
        if policy.approval_level == ApprovalLevel.CONFIRM:
            if approval_callback is None:
                return {"status": "denied", "reason": "Human approval required but no callback provided"}
            approved = await approval_callback(action_name, parameters)
            if not approved:
                return {"status": "denied", "reason": "Human reviewer rejected the action"}

        # Execute the action
        try:
            result = await executor(**parameters)
            # Record the call
            self._call_log.setdefault(action_name, []).append(now)
            return {"status": "success", "result": result}
        except Exception as e:
            return {"status": "error", "reason": str(e)}

# Define policies for each available action
policies = {
    "read_database": ActionPolicy(
        approval_level=ApprovalLevel.AUTO,
        max_calls_per_hour=100,
        allowed_scopes=["users.profile", "products", "orders.status"],
    ),
    "send_email": ActionPolicy(
        approval_level=ApprovalLevel.LOG,
        max_calls_per_hour=10,
    ),
    "process_payment": ActionPolicy(
        approval_level=ApprovalLevel.CONFIRM,
        max_calls_per_hour=5,
        max_value_per_call=500.00,
    ),
    "delete_record": ActionPolicy(
        approval_level=ApprovalLevel.CONFIRM,
        max_calls_per_hour=3,
        allowed_scopes=["drafts", "temp_files"],
    ),
}

An action permission system with scope restrictions, rate limiting, and tiered approval levels

Monitoring and Circuit Breakers

Runtime monitoring is what turns your guardrails from static rules into an adaptive safety system. The circuit breaker pattern — borrowed from distributed systems — automatically shuts down agent operations when it spots anomalous behavior. The idea is simple: track key metrics in real time, and if anything crosses a threshold, trip the breaker and stop the agent before damage piles up.

The metrics you want to watch include token consumption rate, tool call frequency, error rate, loop detection (the same action repeating over and over), and wall-clock execution time. A healthy agent completing a normal task shows predictable patterns across all of these. When you see deviations — a sudden spike in tool calls, or a string of identical failed actions — that's your signal the agent is stuck in a loop, confused by unexpected input, or being manipulated by injected instructions.

In practice, you'll want circuit breakers at multiple levels. A per-request breaker caps the total resources a single agent call can consume. A per-user breaker prevents one user — or a compromised account — from hogging system resources. A system-wide breaker kicks in when aggregate error rates across all users exceed acceptable thresholds, signaling something systemic like model degradation or an upstream API outage.

Your observability stack should capture complete agent trajectories — every prompt, every tool call with its arguments and results, every output — in structured logs that support post-incident analysis. When something goes wrong, being able to replay the exact sequence of events that led to the failure is invaluable for finding root causes and hardening your guardrails. Use privacy-preserving techniques like hashing PII fields before storage so you get detailed observability without creating a data liability.

Frameworks: NeMo Guardrails, Guardrails AI, and Beyond

Several open-source frameworks have popped up to make guardrail implementation easier, each with its own philosophy. NVIDIA's NeMo Guardrails takes a dialogue-management approach, using a custom scripting language called Colang to define conversational rails — basically what conversation flows are allowed and which are off-limits. It plugs into LangChain and supports input/output rails, topical constraints, and fact-checking against knowledge bases. The Colang abstraction makes it approachable for non-engineers, though it can feel limiting when you need complex programmatic guardrails.

Guardrails AI (the guardrails-ai library) takes a different angle: structured output validation. It uses XML-based specs called "Guards" to define expected output schemas, validators, and corrective actions. When the model's output fails validation, the framework automatically re-prompts with specific feedback about what went wrong — a technique called "re-asking." This works especially well when you need agent outputs to conform to strict data contracts like API response formats or database schemas.

If you're building a custom guardrail stack, a modular middleware architecture gives you the most flexibility. Each guardrail lives as an independent middleware function that takes the agent's input or output, runs its check, and either passes the data through, modifies it, or raises an exception. This lets you compose, reorder, and independently test each guardrail. It also makes it easy to A/B test different configurations in production to find the sweet spot between safety and user experience.

Tip

Start with the simplest guardrail stack that covers your known risks, then layer on more based on production data. Over-engineering before deployment usually leads to overly restrictive systems that frustrate users. Under-engineering leads to incidents. Iterative refinement based on real traffic gives you the best balance.

Key Takeaways

You can't deploy autonomous agents in production without guardrails. Without them, failures aren't just likely — they're inevitable, and potentially catastrophic.
Input validation needs to combine pattern matching, encoding detection, and ML classifiers to defend against prompt injection and adversarial content.
Output filtering should cover toxicity, PII leakage, hallucination detection, and format validation. Relying on system prompt instructions alone won't cut it.
Action permission systems should enforce scope restrictions, rate limits, and tiered approval levels. High-risk actions always need human sign-off.
Circuit breakers from distributed systems engineering give you runtime safety by automatically halting agents that start behaving abnormally.
Frameworks like NeMo Guardrails and Guardrails AI speed up implementation, but production systems often benefit most from a modular middleware approach that lets you compose custom configurations.
Iterating on guardrails based on real production traffic beats trying to predict every failure mode upfront.