Vercel DeepSec: Can AI Finally Audit Its Own Code?

There's a specific kind of irony baked into the current AI coding moment. The same tools that let a solo developer ship a full-stack app in a weekend are also the tools generating the security holes that will eventually embarrass that developer—or their users. Speed and sloppiness have always been correlated in software development, but AI has turbocharged both ends of that equation simultaneously.

The incidents aren't hypothetical anymore. In recent months, AI coding agents have deleted entire projects, wiped production databases while developers were working on something completely unrelated, and—in one particularly awkward episode—Apple's internal Claude.md file was leaked. These aren't theoretical attack surfaces. They're real damage, happening in real codebases, at an accelerating pace.

Vercel's answer to this is DeepSec, a security harness released this week that wraps around existing AI coding agents—specifically Claude Code and OpenAI's Codex—and attempts to impose structure on what is otherwise a fairly chaotic security review process. Whether it actually solves the problem is more nuanced than the tool's arrival suggests.

What DeepSec Actually Does (And Why the Architecture Matters)

The core complaint about using Claude or any general-purpose AI agent for security reviews is well-captured in the AI LABS walkthrough: "If you ask Claude code or any agent for a security review, it will start by directly scanning the code base and then produce a full review report that not only takes a lot of time, but it also consumes a lot of tokens, and the review might still miss things."

That's three problems bundled together: slow, expensive, and unreliable. DeepSec addresses all three through a staged pipeline that's worth understanding, because the architecture is where the interesting engineering decisions live.

Step one is pure regex—no AI involved, no token burn. DeepSec scans every file in your repository using pattern matching to identify code that's likely to contain security-sensitive areas: authentication logic, database queries, API endpoints, input handling. This is fast, cheap, and deterministic. It narrows a potentially enormous codebase down to the files that actually warrant AI scrutiny.

Step two is where the expensive models come in. DeepSec deploys Anthropic's Claude Opus 4.7 at max effort and OpenAI's GPT 5.5 at "x-high reasoning"—the tool's own description, not a typo. These are not lightweight models, and they're running in parallel on batches of roughly five files at a time. The parallelization is the key architectural move: instead of feeding an agent your entire codebase sequentially and hoping it maintains context, DeepSec breaks the problem into discrete, manageable chunks. Each chunk gets a fresh prompt assembled from project metadata, framework context, and the information stored in the info.md file that DeepSec generates during initialization.

Once the batch analysis is complete, findings are merged, deduplicated, normalized, and optionally revalidated—a second-pass check specifically designed to flag false positives before anything reaches the export stage.

The output is structured enough to flow directly into a ticket system: each finding gets its own file, organized by severity, listing the exact source lines, the commit that introduced the issue, the developer who committed it, and recommended remediation steps.

The Test Results: Useful, With Caveats

The AI LABS team ran DeepSec against two codebases. The first was an OWASP practice application containing ten deliberately embedded vulnerabilities—essentially a target range for security tools. DeepSec surfaced three findings, which sounds bad until you understand why: the info.md file the team generated explicitly listed the ten known vulnerabilities, so DeepSec treated them as documented and focused its attention elsewhere. It was, in a sense, doing exactly what it was configured to do. A tool that ignores already-known issues to focus on unknown ones is arguably more useful than one that flags everything—including the things you already know about.

The second test, on a different app with a cleaner info.md, produced nine well-scoped findings with full severity ratings, source attribution, and fix recommendations. Alongside DeepSec, the team ran a parallel Claude review on the same codebase, which initially returned 39 issues. When asked to stay within scope, Claude narrowed that to 13. DeepSec found nine, and Claude's scoped review found a few that DeepSec missed.

What did DeepSec miss? Primarily runtime issues—CORS misconfigurations, logical patterns, architectural decisions. "It focuses only on issues that the code directly contains and that can be resolved directly from the functions themselves," the AI LABS breakdown explains. "It does not identify issues that might arise when the app actually runs."

That's a real limitation, and it maps to a well-known divide in security tooling: static analysis versus dynamic analysis. DeepSec is firmly in the static camp. It reads code; it doesn't run it. CORS problems, race conditions, authentication logic flaws that only manifest under specific runtime conditions—these are exactly the categories where static analysis has always struggled, regardless of whether the static analyzer is a regex engine, a formal verification tool, or a large language model.

The False Positive Question

The reported false positive rate of 10–20% deserves some context. In traditional static analysis security testing (SAST), false positive rates of 50–80% are common enough that many development teams simply disable the tools because the noise drowns out the signal. So 10–20% is genuinely good by that benchmark.

But it's also worth noting where that figure comes from: testing by teams who are, by definition, motivated to present favorable results. Independent, adversarial evaluation of AI security tools is still thin. The revalidation step is designed to push that number down further, but it's optional—which means some users will skip it, and their effective false positive rate will be higher.

What This Doesn't Solve

DeepSec addresses a specific failure mode: AI agents generating code with static, pattern-level security flaws that go unreviewed because nobody thought to run a security scan before shipping. That's a real gap, and closing it matters.

It doesn't address the broader category of problems that produced the alarming incidents cited at the top—agents deleting databases, wiping projects, leaking internal files. Those aren't code vulnerability problems. They're agent behavior problems: insufficient sandboxing, overly permissive access, lack of confirmation steps before destructive actions. A security scanner that reviews code before it ships can't protect you from an agent that goes off-script during execution.

There's also the cost question. DeepSec explicitly reaches for the most powerful available models and runs them in parallel. The AI LABS team used their existing Claude Code subscription to avoid direct API charges, but at scale—on a large enterprise codebase, run on every pull request—the economics get complicated quickly. The tool's designers made a deliberate choice to prioritize thoroughness over cost-efficiency, which is a reasonable value judgment. It's just one worth knowing before you automate this into your CI pipeline.

The false positive rate, the static-only scope, and the token costs are all knowable tradeoffs. The more interesting open question is whether structured AI security tooling can actually keep pace with AI code generation in the wild—not in controlled test environments with OWASP practice apps, but in the sprawling, undocumented, half-refactored codebases that represent most of what developers actually ship.

That gap between what AI security tools can analyze and what AI coding tools can produce is still closing. Whether it closes fast enough is a different question entirely.

Marcus Chen-Ramirez is a Senior Technology Correspondent at Buzzrag. He covered software infrastructure before becoming a journalist and still reads commit histories for fun.