AI Agents With 5M-Token Memory Raise Privacy

Most people who use an AI assistant today are thinking about whether the answer is accurate. They're not thinking about how much of their conversation history the system is holding, where that data lives while it's being processed, or what happens to it if someone gets in. That's not a criticism — it's just where we are. But research coming out of Together AI suggests the scale of that held context is about to get dramatically larger, and it seems worth pausing to think about what that means before it becomes normalized.

Max Ryabinin, identified in the presentation as VP of research and development at Together AI (title independently unverified), recently presented a research project called "Road to 5 Million Sequence Length" at the AI Engineer conference. The core engineering achievement is real and genuinely interesting: Together AI developed a set of techniques that allow AI models to be trained on sequences of up to five million tokens — five million chunks of text, code, or other data — without the whole thing collapsing under its own memory requirements. The ML infrastructure community will care about the how. But the why is where it gets interesting for everyone else.

Ryabinin is direct about what this capability is for. Two use cases anchor the work: AI agents and video generation. On agents specifically: "with the explosion in popularity of agents, you can see a lot of different applications where you might want to put as many tokens as you want in your context, and you want the model to leverage that context effectively." For video: maintaining "temporal consistency" across frames requires holding a lot of history in active memory.

Those are legitimate applications. They're also, if you follow the thread a little further, a description of AI systems that maintain extensive, detailed records of user behavior and interaction — not in a database somewhere, but in the model's active working context during inference.

What five million tokens actually holds

To ground the scale: a token is roughly three-quarters of a word. Five million tokens is approximately 3.75 million words, or somewhere in the range of 30 to 40 standard-length novels depending on genre. For an AI agent handling your calendar, your emails, your browser history, and your ongoing project notes, that context window isn't a curiosity — it's closer to a complete picture of your working life over a meaningful stretch of time.

The reason this matters from a security and privacy standpoint isn't that Together AI is doing anything nefarious. The research is published, the techniques are being made available, and the intent is to advance open-source model training. The reason it matters is that the infrastructure for this kind of deep context retention is being democratized. Once the engineering problem is solved and the methods are documented, the question of how that capability gets used — and by whom, with what safeguards — is no longer a research question. It's a deployment question.

The architecture, briefly

For readers who want the technical picture: Ryabinin's talk walks through a layered stack of memory optimization techniques, each addressing a different bottleneck. Starting with a standard LLaMA 3B model on a single eight-GPU node (eight H100s), even placing the model parameters exhausts GPU memory before training begins. Fully sharded data parallelism spreads the parameters across all eight GPUs — necessary, but not sufficient. Attention activations still overflow.

DeepSpeed Ulysses, a context parallelism technique originally from Microsoft, distributes the attention computation across GPUs so each handles a subset of attention heads rather than the full sequence. Ryabinin reports this yields roughly an 8x reduction in activation memory. Activation checkpointing — discarding intermediate computations during the forward pass and recomputing them as needed during backpropagation — provides another significant reduction; Ryabinin reports a further factor of eight in his implementation, though the exact figure depends on architecture and configuration. CPU offloading moves transformer block inputs to system memory when they're not actively needed, then prefetches them before backpropagation. Ryabinin credits this technique to Unsloth, noting they believe it was first implemented there, though CPU offloading for deep learning training has a longer history than any single framework.

Even stacking all of that gets you to three million tokens. Five million required something new.

Together AI's contribution — which they're calling Untied Ulysses — goes deeper into the context parallelism step. The key observation: computing attention for even a single group of heads is enough to saturate a GPU's capacity within one iteration. If that's true, you don't need to allocate a buffer for all your head groups simultaneously. You can allocate a smaller buffer, process one chunk of heads, store the partial result, then reuse the same buffer for the next chunk. The memory footprint shrinks substantially. Ryabinin reports that at both 8 billion and 32 billion parameter scales, Untied Ulysses matches the most memory-efficient existing baselines while extending maximum sequence length beyond what prior Ulysses implementations could reach — by around 25% according to the paper, though readers who want to verify that number should go to the source directly.

The part that doesn't get discussed in ML talks

Here's where the infrastructure story and the privacy story converge. Ryabinin makes the case that understanding where memory goes matters even if you're not training at million-token scales: "maybe you might be able to reinvest it in some other ways and speed up your training overall." That's a practitioner's point. The broader version of that point is that every optimization to context capacity increases what an AI system can retain about you during a session — and potentially across sessions, depending on how the deployment is architected.

An AI agent with five million tokens of context isn't just more capable. It's a more detailed record of your interactions, held in active GPU memory, somewhere in an inference infrastructure you probably don't control. The attack surface for that is different from a database breach — context memory isn't persistent in the traditional sense — but the exposure window during active inference is real. Prompt injection attacks, for instance, become more consequential when the model has extensive user history to draw from. An attacker who can manipulate a long-context agent's reasoning has access to a much richer substrate than one operating on a short-window model.

None of this is Together AI's problem specifically, and nothing in Ryabinin's talk should be read as dismissing these concerns — they're simply outside the scope of an ML engineering talk. But the gap between "we solved the memory problem" and "here's what users should understand about systems built on this capability" is exactly the gap that tends to close too slowly in this industry.

The techniques Ryabinin describes are public, documented, and designed to be adopted. That's the point — Together AI is building infrastructure for the broader open-source AI ecosystem. Which means the question isn't really about Together AI's deployment practices. It's about what happens when every team building an AI agent product has access to five-million-token context windows and the question of what to do with all that user history is left as an exercise for the developer.

Historically, "an exercise for the developer" has not been a reassuring answer to privacy questions.

Rachel "Rach" Kovacs is Buzzrag's cybersecurity and privacy correspondent.