OpenAI's Websocket Shift Could Cut AI Bandwidth

Here's something wild: every time an AI agent makes a tool call—checking a file, listing a directory, whatever—it resends your entire conversation history to OpenAI's servers. The whole thing. Every single time.

You ask an agent to improve your app's SEO. It decides to scan your codebase. First tool call: send full history. It finds something interesting in the /src directory. Second tool call: send full history again. It makes an edit. Third tool call: yep, full history. One user message might spawn hundreds of these calls, each one hauling around increasingly massive context payloads.

This is how OpenAI's API has worked since tool calling became a thing. And until this week, we've all just... lived with it? 🤷‍♀️

OpenAI just changed that. They're moving their Responses API from traditional REST calls to websockets, and according to developer Theo Browne's breakdown, the results are kind of absurd: 90%+ bandwidth reduction, 20-40% speed improvements for agentic workflows.

The Groundhog Day Problem

To understand why this matters, you need to understand how stateless these AI models actually are. When an agent makes a tool call and gets results back, the model doesn't just... remember things. It's not sitting there thinking about your problem while waiting for the file system to respond.

The generation stops. The AI is effectively dead until you feed it the tool call results. And when you do, you have to remind it of everything that happened before, because it has no memory of the conversation.

"It's almost like a Groundhog Day type thing where every time the model wakes up, there's nothing left in its brain," Browne explains in his video. "It's stateless. It has nothing going on. So all of the state has to be shipped back every single time."

This creates a genuinely bonkers situation: an agent might ingest 100,000 tokens (roughly two megabytes of text) and respond with eight words. This happens constantly in production systems.

Why Caching Doesn't Fix This

I know what you're thinking—isn't that what prompt caching is for? And yeah, caching helps with compute costs and speeds up the time to first token. But here's the thing that trips people up: caching doesn't change how much data you send.

The cache key is a hash of your conversation history. You still have to send the entire history so OpenAI can hash it and determine what's cached. The cache just means the GPU doesn't have to reprocess those tokens from scratch. You're still pushing all that data over the wire every single time.

Browne emphasizes this point because it's a common misconception: "The cache does not change how much data you send. The cache just changes how long it takes to process the data."

The Routing Problem

Here's where it gets architectural. OpenAI doesn't route your requests directly to GPUs. There's an orchestration layer—API servers that check permissions, look for cached values, find available GPUs, and route requests accordingly.

These API servers are stateless by design. Your first request might hit API box #1. Your second might hit box #4. This means every single request needs to include full context, recheck auth, verify cache status—the works. Even for follow-ups that happen milliseconds apart.

Trying to maintain state across stateless servers at this scale? "Not viable. Not viable at all," as Browne puts it. You'd need an external data store that every API box queries for every request. The latency alone would kill you, and you'd never know how long to keep sessions alive.

Enter Websockets

Websockets solve this with an almost embarrassingly simple guarantee: you hit the same box for the duration of your session.

That persistent connection means the API server can keep track of what you've done. It already checked your auth. It knows what's cached. It has your conversation history in memory. When a tool call completes, you just send the new data—the tool result—not the entire multi-megabyte history.

"The websocket is less a protocol here, more a guarantee," Browne notes. It's not that websockets are magically faster. It's that they guarantee stateful sessions, which eliminates all that redundant work.

OpenAI's own numbers: "Websockets keep a persistent connection to the responses API, allowing you to send only new inputs instead of sending round trips for the entire context on every turn. By maintaining in-memory state across interactions, it avoids repeated work and speeds up agentic runs with 20+ tool calls by 20 to 40%."

The Bigger Picture

This change mostly benefits agentic workflows—systems that make lots of tool calls per user message. For basic chat apps where users send one message and wait for a response, the optimization is less dramatic. Reloading context once per human message is fine. Reloading it hundreds of times for one message? Not fine.

What's interesting is how long it took to get here. Tool calling got "stapled on" to existing REST APIs, and we all just accepted the inefficiency because that's how it worked. OpenAI open-sourced the specification (called Open Responses), which means Anthropic, Google, and others will likely adopt this pattern.

Which is the kind of thing that makes you realize how genuinely early we are in this space. We're still figuring out basics like "maybe don't send the same data 400 times per conversation." There's so much low-hanging fruit in the entire stack—networking, storage, API design, all of it.

Deep infrastructure work isn't dead because of AI. If anything, it matters more than ever. We get to rethink how these systems should work from first principles, which is the fun part.

Yuki Okonkwo is Buzzrag's AI & Machine Learning Correspondent.