Edited by humans. Written by AI. How our editing works
All articles

How OpenGov Deployed AI Agents for Local Government

OpenGov engineer Gabe De Mesa details how OG Assist brought AI agents to thousands of state and local governments—and what it actually took to make them work.

Marcus Chen-Ramirez

Written by AI. Marcus Chen-Ramirez

June 26, 20268 min read
Share:
Professional man in suit smiling next to software interface screenshot, with text "Agents in Production" and "Shipping is…

Photo: AI. Wren Sugimoto

Most AI agent talks describe what's theoretically possible. Gabe De Mesa's presentation at the AI Engineer conference describes something narrower and considerably more interesting: what happened when a real engineering team actually shipped agents to thousands of state and local governments, ran them in production, hit walls, and rebuilt their way around those walls.

OpenGov makes ERP software for government—budgeting, procurement, asset management, permitting. Not glamorous, but consequential. When the company launched OG Assist, a chat-style AI agent embedded across its product suite, the stakes weren't abstract. A broken response about utility billing rate codes or procurement workflows lands differently when the user is a city clerk, not a startup employee with a high tolerance for bugs.

That context shapes everything De Mesa describes. The engineering decisions aren't just architectural preferences—they're responses to the specific pressure of deploying AI into regulated, bureaucratic, high-stakes environments where trust is slow to build and fast to lose.

The Framework Bet

The most revealing early decision De Mesa describes is the team's move away from LangGraph—one of the most popular frameworks for building AI agent pipelines—in favor of building their own agent loop on top of Effect, an open-source TypeScript library.

This is worth pausing on. LangGraph has a significant community and ecosystem. Abandoning it means giving up a lot of pre-built scaffolding in exchange for control. De Mesa is candid about the tradeoff: "We decided to move over to our own kind of Effect Native Agent Loop to have full regency over this Agent Loop such that if we have complex use cases or features that we need to build, we could kind of get in—we had full control of the Agent Loop."

Effect is not an AI-specific tool. It's a functional programming library for TypeScript that bundles schema validation, error handling, structured concurrency, and—critically for this use case—distributed tracing. The OpenGov team's insight was that these general-purpose software engineering primitives matter enormously in agentic systems, where you're stitching together model calls, tool executions, API integrations, and real-time UI updates in ways that can fail silently and expensively.

The tracing benefit is concrete. Effect automatically tags function calls with spans that feed into distributed traces—which means when an agent call buries a failure somewhere across three API hops, you can actually find it. "You can't scale what you can't see," De Mesa says, and it's one of those phrases that's obvious in retrospect but takes production incidents to really internalize.

Whether this decision generalizes is a genuine open question. OpenGov had specific reasons to go deep on a functional TypeScript stack—their existing codebase, team expertise, and the operational demands of a multi-tenant government SaaS product. Teams without those anchors might find LangGraph or similar frameworks plenty sufficient. The lesson isn't "use Effect." It's "understand what you're giving up when you use a framework that abstracts away control, and be honest about when that tradeoff stops working."

Google's Agent Protocol, Deployed

De Mesa also describes adopting Google's Agent-to-Agent (A2A) protocol—an open specification for how AI agents describe themselves and communicate with each other. OpenGov used it not primarily for agent-to-agent communication, but as a schema contract between their frontend and backend.

This is a clever and pragmatic application of a protocol that's still relatively young. By building their data models around the A2A spec—including the "agent card" that carries agent name, description, and capabilities—they got a shared language that both sides of their system could consume and produce without bespoke negotiation. "Having this kind of rigorous protocol, this rigorous spec really helped drive our development and drive alignment," De Mesa explains.

It's also worth noting what this represents in a broader context. A2A is one of several emerging standards (alongside Anthropic's MCP and others) jockeying to become the connective tissue of multi-agent systems. OpenGov's adoption suggests at least one production engineering team found the spec practical enough to build around—not just to demo. Whether A2A ultimately wins out in a landscape where every major AI lab has its own protocol agenda remains genuinely unclear.

Safety as Architecture, Not Policy

Two of the more substantive engineering choices De Mesa discusses concern safety—and they're implemented at the system level, not the prompt level.

The first is human-in-the-loop interrupts. When OG Assist tries to execute a tool call that requires approval—any operation that mutates data, for example—the agent loop deterministically pauses and surfaces a UI asking a human to accept or reject the action. "Always always always making sure that humans are in the driver's seat," De Mesa says. The emphasis is deliberate and a little telling. This isn't a feature that emerged from a product roadmap. It sounds like something a team builds after thinking carefully about what happens when an agent does something irreversible to a government database.

The second is sandboxing. Whenever OG Assist executes code or creates files, it does so in an ephemeral isolated environment that spins up on demand and tears down when the task is complete. The agent can write and run code, generate PDFs, create files—all without touching production systems. This is standard practice in security-sensitive contexts, but it's notable that OpenGov treats it as a prerequisite for giving agents code execution capabilities, rather than an afterthought.

Neither of these is a novel invention. Both reflect well-established software engineering principles applied to a new context. What's interesting is that De Mesa presents them not as compliance theater but as the actual architecture that allows the team to move fast. Trust, in this framing, is an engineering problem before it's a people problem.

The Memory Problem Nobody Has Solved Cleanly

Long-context handling is one of those AI production challenges that every team hits and nobody has a perfect answer for. De Mesa describes OpenGov's approach: rolling summarization.

Rather than continuously stuffing the full conversation history into each model call—which hits token limits and degrades response quality as conversations lengthen—the team maintains a running summary that compresses older exchanges, while preserving the most recent messages verbatim. When a user refers back to something discussed much earlier, the agent can do recall over the summary rather than the full transcript.

It works. De Mesa says it's worked well for them. It's also a pragmatic approximation of something more elegant that the field hasn't fully figured out yet. Rolling summaries lose fidelity. Summaries of summaries lose more. The problem of giving agents genuinely reliable long-term memory across complex, branching conversations remains open—and De Mesa's candor about hitting "many hurdles" with legacy models and token limits is refreshing in a space that tends toward confident retrospective narration.

The Feedback Loop That Actually Matters

"Shipping is the start, not the finish."

That's the framing De Mesa puts on OG Assist's evaluation and feedback architecture, and it deserves attention because it runs counter to how AI products often get discussed. The launch is not the story. The iteration is.

OpenGov collects feedback through multiple channels—direct user outreach, thumbs-up/thumbs-down signals on individual agent responses, and automated evaluations that run in CI against real completions, checking whether prompts trigger the expected tool calls and produce accurate results. The human signal and the automated signal feed into each other. When a thumbs-down flags a bad response, the team can examine what the automated evals missed—or update the evals to catch that failure class in the future.

This is less exciting than most things discussed in AI engineering talks. It's also probably the most important part of what De Mesa describes. The gap between "we shipped an agent" and "we have an agent that reliably does useful things over time" is filled almost entirely by feedback infrastructure and the willingness to act on it.

An Honest Question About the Customers

There's one dimension De Mesa doesn't dwell on, and it's worth naming: the end users of OG Assist are government employees navigating ERP software. That's a specific population with specific constraints—varying levels of technical fluency, workflows shaped by regulation, institutional cultures that are often skeptical of new tools, and very little tolerance for errors that affect public services.

How those users actually experience OG Assist—whether it saves them time, whether the human-in-the-loop approvals feel empowering or cumbersome, whether rolling summarization creates subtle errors that a busy city procurement officer wouldn't catch—none of that surfaces in a developer talk. That's expected; De Mesa is speaking to engineers, not policy analysts.

But it's a gap worth holding onto. The engineering architecture De Mesa describes is genuinely thoughtful. Whether it translates into better government for the people those governments serve is a question that requires a different kind of evidence than distributed traces and eval pass rates.

The agents are in production. The harder audit is still ahead.


Marcus Chen-Ramirez is a senior technology correspondent for Buzzrag covering AI, software development, and the intersection of technology and society.

From the BuzzRAG Team

AI Moves Fast. We Keep You Current.

Framework breakdowns, tool comparisons, and AI coding insights — distilled from the best tech YouTube creators. Free, weekly.

Weekly digestNo spamUnsubscribe anytime

More Like This

Man in dark shirt gesturing while discussing AgentCraft game interface with fantasy strategy gameplay and "Games =…

This Developer Turned Coding Agents Into an RTS Game

Ido Salomon built AgentCraft to solve a weird problem: managing multiple AI coding agents feels like playing StarCraft. So he made it literally look like that.

Yuki Okonkwo·2 months ago·6 min read
Bold red text warns "New & Free DeepSeek Is SCARY!" alongside a blue whale logo and red arrow pointing right on white…

DeepSeek V4: Build Apps and AI Agents for Free

DeepSeek V4 lets non-coders build apps and run AI agents for free. Here's what actually works, what breaks, and what the hype leaves out.

Marcus Chen-Ramirez·2 months ago·6 min read
Developer in profile wearing cap with code editor and git branch diagram visible, showing reduction from 12K to 200 lines…

Cursor Replaced 15,000 Lines of Code with 200 Lines of Markdown

How Cursor's David Gomes deleted a complex feature and rebuilt it with prompts—plus the very real problems that came with trusting models instead of code.

Marcus Chen-Ramirez·2 months ago·6 min read
White whale logo and "V4" text on black background with blue-to-purple gradient border

DeepSeek V4 Uses 90% Less Memory Than Its Predecessor

DeepSeek's new V4 models achieve dramatic efficiency gains through hybrid attention mechanisms, running million-token contexts at a fraction of the cost.

Marcus Chen-Ramirez·2 months ago·6 min read
A woman in a maroon shirt speaks to camera with code and diagrams visible on a dark background, labeled "think series:…

AI Agents in Production: What Actually Works

IBM's Shailaja Patel-Pranav breaks down why AI agents fail in production—and the coordination patterns that make them actually reliable in enterprise workflows.

Marcus Chen-Ramirez·7 days ago·7 min read
Google I/O session speaker presenting on AI agent development, with microphone visible in professional setting

Six Protocols That Make AI Agents Actually Work

Google's agent protocol stack—MCP, A2A, UCP, AP2, A2UI, AGUI—explained through a kitchen manager demo. What each protocol does and when to reach for it.

Marcus Chen-Ramirez·1 month ago·7 min read
A shocked man with wide eyes appears next to the Linear logo and text reading "Issue tracking is dead" against a dark…

Linear Says Issue Tracking Is Dead. Here's What's Next

Linear, the issue tracker beloved by engineers, just declared its own category obsolete. AI agents are changing how software gets built—for better or worse.

Tyler Nakamura·3 months ago·6 min read
Retro arcade game start screen labeled ARC-AGI-3 with bold text overlay about two new models stirring governments

New AI Benchmarks Expose the Gap Between Hype and Reality

OpenAI and Anthropic promise breakthrough models, but ARC-AGI-3 benchmark reveals AI still scores under 0.5% where humans hit 100%. What gives?

Marcus Chen-Ramirez·3 months ago·6 min read

RAG·vector embedding

2026-06-26
1,908 tokens1536-dimmodel text-embedding-3-small

This article is indexed as a 1536-dimensional vector for semantic retrieval. Crawlers that parse structured data can use the embedded payload below.