Google's Model Armor: AI Security Through

AI agent security has moved from theory to practice. Google Cloud's approach -- shown in a recent tutorial by Aron Eidelman -- uses callback mechanisms to create security checkpoints. These checkpoints catch threats before they reach language models or users.

The design is simple: add security checks at the two weakest points in an AI agent's workflow. First, before the agent sends a prompt to the language model. Second, after the model returns a response. The goal isn't to build a wall around the whole system. It's to create specific spots where automated rules can actually work.

The Input Problem

Prompt injection remains one of the more awkward flaws in AI systems. Users -- whether hostile or just curious -- can slip instructions into prompts that override an agent's intended behavior. "Ignore your previous instructions" has become the "drop tables" of the AI era.

Eidelman shows how the before_model callback catches user prompts before they reach the language model: "This function runs just before the LLM request is sent, allowing us to inspect and sanitize the user's prompt using the model armor API's sanitize user prompt method."

If Model Armor spots a prompt injection attempt or policy breach, the callback sends a preset safe response. The model never sees the bad prompt. The compute cost is tiny -- a policy check instead of a full model run -- and the security line is clear.

The real question is how good these detection tools are. Pattern matching catches obvious attacks. But prompt engineering tricks evolve fast. The video doesn't explain Model Armor's detection methods. That detail matters a lot when judging how well it holds up against skilled attackers.

The Output Vulnerability

The second checkpoint handles what happens when the language model itself becomes the risk. Models trained on huge datasets sometimes spit out data they shouldn't -- credit card numbers, personal details, or private patterns from their training data.

The after_model callback checks the raw LLM response before the user sees it. As Eidelman explains: "If the response contains leaked data or harmful output, we can replace or block the response. Replacing is useful if we want to preserve most of the response but just filter out a small part such as a name or a credit card number."

This targeted approach -- removing specific sensitive patterns while keeping the rest -- is a smart middle ground. Blocking whole responses creates a bad user experience. It may also tip off attackers that they've hit something worth probing. Selective filtering keeps things working while containing the leak.

The technical setup trims credit card numbers to their last four digits. This removes the disclosure risk while keeping enough info for valid use cases. It's the digital version of a receipt showing XXXX-XXXX-XXXX-1234.

What This Approach Misses

Callback-based security has clear design strengths. It's modular. It doesn't require changing the base model. It can be updated on its own as new threats appear. But it works only at the API boundary.

This means Model Armor can't fix problems in the model itself. It can't address bias in training data, tricks that fool the model's reasoning, or outputs that are harmful in ways that depend on context. A response can pass every policy check while still being misleading or dangerous in a specific setting.

The video shows blocking obvious prompt injection attempts and catching clear data patterns like credit card numbers. What it doesn't show is how the system handles gray areas. Think borderline prompts, outputs that hint at sensitive info without stating it outright, or attacks that work at the meaning level rather than the pattern level.

Google's docs point to more resources on Model Armor. One video is titled "We tried to jailbreak our AI (and Model Armor stopped it)." The framing is confident. But jailbreaking is an arms race. What works today may fail against next month's tricks.

Policy as Code

What makes callback-based security workable is that it turns security policies into code at fixed checkpoints. Rather than trying to make the language model itself "safer" -- a research problem with no clear fix -- this approach enforces rules at the interface layer where rules actually stick.

The trade-off is plain: you're not making the model more trustworthy. You're limiting what it can be asked and what it can say. For many business use cases, that's enough. For tasks that need real model alignment and safety, it's a band-aid, not a cure.

Eidelman's demo shows a clean split of duties. The agent team focuses on features. The security team sets the rules. Model Armor enforces them at runtime. This is solid engineering. Whether it's enough depends on your threat model and where you deploy.

The video ends with a teaser about using "anti-gravity to vibe code an agent." Marketing speak aside, it signals Google's ongoing work on AI security tools. And that's fitting. If there's one thing everyone in AI security agrees on, it's that today's defenses are stopgaps at best.

Samira Okonkwo-Barnes covers technology policy and regulation for Buzzrag.