How OpenGov Deployed AI Agents for Local

Most AI agent talks describe what's theoretically possible. Gabe De Mesa's presentation at the AI Engineer conference describes something narrower and considerably more interesting: what happened when a real engineering team actually shipped agents to thousands of state and local governments, ran them in production, hit walls, and rebuilt their way around those walls.

OpenGov makes ERP software for government—budgeting, procurement, asset management, permitting. Not glamorous, but consequential. When the company launched OG Assist, a chat-style AI agent embedded across its product suite, the stakes weren't abstract. A broken response about utility billing rate codes or procurement workflows lands differently when the user is a city clerk, not a startup employee with a high tolerance for bugs.

That context shapes everything De Mesa describes. The engineering decisions aren't just architectural preferences—they're responses to the specific pressure of deploying AI into regulated, bureaucratic, high-stakes environments where trust is slow to build and fast to lose.

The Framework Bet

The most revealing early decision De Mesa describes is the team's move away from LangGraph—one of the most popular frameworks for building AI agent pipelines—in favor of building their own agent loop on top of Effect, an open-source TypeScript library.

This is worth pausing on. LangGraph has a significant community and ecosystem. Abandoning it means giving up a lot of pre-built scaffolding in exchange for control. De Mesa is candid about the tradeoff: "We decided to move over to our own kind of Effect Native Agent Loop to have full regency over this Agent Loop such that if we have complex use cases or features that we need to build, we could kind of get in—we had full control of the Agent Loop."

Effect is not an AI-specific tool. It's a functional programming library for TypeScript that bundles schema validation, error handling, structured concurrency, and—critically for this use case—distributed tracing. The OpenGov team's insight was that these general-purpose software engineering primitives matter enormously in agentic systems, where you're stitching together model calls, tool executions, API integrations, and real-time UI updates in ways that can fail silently and expensively.

The tracing benefit is concrete. Effect automatically tags function calls with spans that feed into distributed traces—which means when an agent call buries a failure somewhere across three API hops, you can actually find it. "You can't scale what you can't see," De Mesa says, and it's one of those phrases that's obvious in retrospect but takes production incidents to really internalize.

Whether this decision generalizes is a genuine open question. OpenGov had specific reasons to go deep on a functional TypeScript stack—their existing codebase, team expertise, and the operational demands of a multi-tenant government SaaS product. Teams without those anchors might find LangGraph or similar frameworks plenty sufficient. The lesson isn't "use Effect." It's "understand what you're giving up when you use a framework that abstracts away control, and be honest about when that tradeoff stops working."

Google's Agent Protocol, Deployed

De Mesa also describes adopting Google's Agent-to-Agent (A2A) protocol—an open specification for how AI agents describe themselves and communicate with each other. OpenGov used it not primarily for agent-to-agent communication, but as a schema contract between their frontend and backend.

This is a clever and pragmatic application of a protocol that's still relatively young. By building their data models around the A2A spec—including the "agent card" that carries agent name, description, and capabilities—they got a shared language that both sides of their system could consume and produce without bespoke negotiation. "Having this kind of rigorous protocol, this rigorous spec really helped drive our development and drive alignment," De Mesa explains.

It's also worth noting what this represents in a broader context. A2A is one of several emerging standards (alongside Anthropic's MCP and others) jockeying to become the connective tissue of multi-agent systems. OpenGov's adoption suggests at least one production engineering team found the spec practical enough to build around—not just to demo. Whether A2A ultimately wins out in a landscape where every major AI lab has its own protocol agenda remains genuinely unclear.

Safety as Architecture, Not Policy

Two of the more substantive engineering choices De Mesa discusses concern safety—and they're implemented at the system level, not the prompt level.

The first is human-in-the-loop interrupts. When OG Assist tries to execute a tool call that requires approval—any operation that mutates data, for example—the agent loop deterministically pauses and surfaces a UI asking a human to accept or reject the action. "Always always always making sure that humans are in the driver's seat," De Mesa says. The emphasis is deliberate and a little telling. This isn't a feature that emerged from a product roadmap. It sounds like something a team builds after thinking carefully about what happens when an agent does something irreversible to a government database.

The second is sandboxing. Whenever OG Assist executes code or creates files, it does so in an ephemeral isolated environment that spins up on demand and tears down when the task is complete. The agent can write and run code, generate PDFs, create files—all without touching production systems. This is standard practice in security-sensitive contexts, but it's notable that OpenGov treats it as a prerequisite for giving agents code execution capabilities, rather than an afterthought.

Neither of these is a novel invention. Both reflect well-established software engineering principles applied to a new context. What's interesting is that De Mesa presents them not as compliance theater but as the actual architecture that allows the team to move fast. Trust, in this framing, is an engineering problem before it's a people problem.

The Memory Problem Nobody Has Solved Cleanly

Long-context handling is one of those AI production challenges that every team hits and nobody has a perfect answer for. De Mesa describes OpenGov's approach: rolling summarization.

Rather than continuously stuffing the full conversation history into each model call—which hits token limits and degrades response quality as conversations lengthen—the team maintains a running summary that compresses older exchanges, while preserving the most recent messages verbatim. When a user refers back to something discussed much earlier, the agent can do recall over the summary rather than the full transcript.

It works. De Mesa says it's worked well for them. It's also a pragmatic approximation of something more elegant that the field hasn't fully figured out yet. Rolling summaries lose fidelity. Summaries of summaries lose more. The problem of giving agents genuinely reliable long-term memory across complex, branching conversations remains open—and De Mesa's candor about hitting "many hurdles" with legacy models and token limits is refreshing in a space that tends toward confident retrospective narration.

The Feedback Loop That Actually Matters

"Shipping is the start, not the finish."

That's the framing De Mesa puts on OG Assist's evaluation and feedback architecture, and it deserves attention because it runs counter to how AI products often get discussed. The launch is not the story. The iteration is.

OpenGov collects feedback through multiple channels—direct user outreach, thumbs-up/thumbs-down signals on individual agent responses, and automated evaluations that run in CI against real completions, checking whether prompts trigger the expected tool calls and produce accurate results. The human signal and the automated signal feed into each other. When a thumbs-down flags a bad response, the team can examine what the automated evals missed—or update the evals to catch that failure class in the future.

This is less exciting than most things discussed in AI engineering talks. It's also probably the most important part of what De Mesa describes. The gap between "we shipped an agent" and "we have an agent that reliably does useful things over time" is filled almost entirely by feedback infrastructure and the willingness to act on it.

An Honest Question About the Customers

There's one dimension De Mesa doesn't dwell on, and it's worth naming: the end users of OG Assist are government employees navigating ERP software. That's a specific population with specific constraints—varying levels of technical fluency, workflows shaped by regulation, institutional cultures that are often skeptical of new tools, and very little tolerance for errors that affect public services.

How those users actually experience OG Assist—whether it saves them time, whether the human-in-the-loop approvals feel empowering or cumbersome, whether rolling summarization creates subtle errors that a busy city procurement officer wouldn't catch—none of that surfaces in a developer talk. That's expected; De Mesa is speaking to engineers, not policy analysts.

But it's a gap worth holding onto. The engineering architecture De Mesa describes is genuinely thoughtful. Whether it translates into better government for the people those governments serve is a question that requires a different kind of evidence than distributed traces and eval pass rates.

The agents are in production. The harder audit is still ahead.

Marcus Chen-Ramirez is a senior technology correspondent for Buzzrag covering AI, software development, and the intersection of technology and society.