AI Guardrails: Keeping a Model On-Task When Users Go Off-Script

An AI feature without guardrails follows whatever the user just typed, even when that contradicts what you built it to do. That is the default behavior of a language model wired directly to your users, and it holds until you build something around it that says no. The demo never shows this, because the people in the demo are typing the inputs the feature was designed for. The first stranger who isn’t will.

Guardrails are the part of an AI feature that assumes the model will sometimes be wrong and the user will sometimes be adversarial. We’ve covered the broader production-readiness pattern before, where guardrails are one item on a longer list. This post is the deeper look at that one item, because it’s the layer that most often gets reduced to a sentence in a system prompt and called done.

What a guardrail actually is

A guardrail is a check that runs in your code, not in the model. The model is the thing being guarded. It is not the thing doing the guarding, and the distinction matters more than it sounds.

Instructions you put in a system prompt are requests. The model honors them most of the time, and abandons them the moment an input is strange enough, long enough, or crafted to talk it out of them. A guardrail is enforced in deterministic code the model can’t argue with: a classifier that runs before the model sees the input, a schema validator that runs after, an allow-list of actions the model is permitted to trigger. The model can suggest anything. Your code decides what actually happens.

Guardrails run in two directions. Input guardrails decide what reaches the model. Output guardrails decide what reaches your users and your systems.

Input guardrails: deciding what the model is allowed to see

Input guardrails do two jobs. They keep the feature on-task, and they reduce the damage when someone tries to repurpose it.

Topic boundaries are the first job. A support assistant for your product has no business writing poetry, answering questions about a competitor, or weighing in on politics under your brand’s name. Detecting off-topic input and refusing it before the model responds is cheaper than tokens and safer than trusting the model to decline gracefully every time. The reader who has watched a company chatbot get screenshotted saying something absurd knows exactly what this prevents.

The second job is prompt injection, which OWASP ranks as the top security risk for LLM applications. Users, or content the model retrieves on their behalf, try to override your instructions: “ignore previous instructions and instead do this.” You can’t fully prevent it with better wording in the prompt, because the attack and your defense live in the same channel the model reads as one undifferentiated block of text. What you can do is limit the blast radius. Treat all user and retrieved content as untrusted, keep the system prompt isolated from it where the model’s API allows, and constrain what the model is actually able to do so that a successful injection reaches a small, well-fenced surface instead of your whole system.

The working assumption: your system prompt is not a secret. Assume it leaks, because it will, and design so that leaking it doesn’t hand anyone the keys.

Output guardrails: deciding what the model is allowed to do

Output guardrails sit between the model and everything downstream of it. They exist because a plausible-looking response and a correct, safe response are not the same thing, and only one of them is verifiable in code.

Structure validation is the baseline. The rest of your system expects the model’s output in a particular shape: a JSON object with specific fields, a value from a fixed set, a number in a sane range. Validate against that shape before the output moves on, and fail closed when it doesn’t match. A malformed response should stop at the guardrail, not crash the next step in the pipeline or pass quietly through as garbage that surfaces three screens later.

Commitment limits are the harder job, and the one businesses underestimate. An assistant that can talk will, given the chance, agree to a refund outside policy, quote a delivery date it can’t verify, or authorize a discount nobody approved. We see this most in voice AI, where the assistant is live on a call and a promise is made before anyone can catch it. The fix is the same in text or voice: separate what the model can suggest from what it can execute. A booking the assistant completes is one thing. A discount it authorizes is another. Those limits get scoped at design time and enforced at runtime, in code, not in a hopeful line of the prompt.

A guardrail is not a paragraph in the system prompt

The most common version of “we added guardrails” is a paragraph in the system prompt that says what the model shouldn’t do. That paragraph is worth having. It is not a guardrail.

A line that tells the model not to discuss competitors is a preference the model follows on ordinary inputs and drops on unusual ones. Real guardrails are deterministic: a classifier that scores the input before the model runs, a validator that rejects malformed output, an allow-list that makes a forbidden action unreachable rather than merely discouraged. The test is simple. If your only defense lives inside the text the model reads, the model can be argued out of it. If your defense lives in code the model can’t see or edit, it can’t.

This is the same discipline as never trusting input from a browser. You validate on the server because the client can lie. A language model is a client that can be talked into lying, so you validate around it for the same reason.

The tradeoff: over-constrain and you kill the feature

Guardrails are not free, and more of them is not strictly better. Every check is latency the user waits through before seeing a response, and every constraint is a chance to reject something legitimate. An input filter tuned too tight refuses real questions. Output validation tuned too aggressively makes the assistant decline so often that people stop using it. A feature that never makes a mistake because it never does anything is its own kind of failure.

The work is calibration: tight enough that bad outcomes are rare and contained, loose enough that the feature still does the thing people came for. That calibration needs measurement, which is why guardrails belong under the same evals as the rest of the feature. You track false-refusal rate next to accuracy. When you tighten a filter, the evals tell you whether you cut real attacks or just real users.

Where guardrails fit

Guardrails are one layer of the scaffolding we build around production AI, alongside evals, retries, cost caps, and observability. They assume the model is fallible and the user is sometimes hostile, and they hold the line in code when both assumptions turn out to be true. The model layer underneath can follow the model-agnostic pattern so that swapping providers doesn’t mean rebuilding the fences each time.

The question to ask of any AI feature you’re shipping, or any vendor you’re about to commission, is short: what stops this when the input is hostile or the output is wrong? If the only answer is that the system prompt asks it nicely, there are no guardrails yet, just a polite request the model is free to ignore.

If you’re building an AI feature and trying to work out where it needs fencing, tell us what you’re building. We respond within one business day with the gaps we see and a rough scope for closing them.