GLM-5's Self-Distillation Trick Solves AI's

Training a 700-billion-parameter AI model is expensive. Training it multiple times because it forgot how to do basic math while you were teaching it to use tools? That's the kind of expensive that gets people fired.

The team behind GLM-5 ran into this exact problem—and their solution involves teaching the model using itself as the teacher. It's weird. It works. And it reveals something important about how we're going to train [the next generation of AI systems.

The Forgetting Problem

Here's the setup: You've trained a massive language model. Now you want to make it better at reasoning. So you do reinforcement learning focused on math and logic. Great—it gets smarter. Then you want to teach it to use tools and APIs, so you do another round of RL focused on agentic behavior. Also great.

Except now your model has partially forgotten how to do the reasoning tasks you just spent weeks training it on. This is called catastrophic forgetting, and it's been a thorn in the side of AI researchers since before anyone was throwing billions of dollars at these problems.

The Hugging Face team discussing GLM-5's technical paper kept circling back to this issue. As one researcher put it: "If you start from the SFT model and you just do like sequential stages of RL training then the general problem you run into is like catastrophic forgetting. So essentially if I do like agentic training on top of a reasoning model perhaps I start to lose some of my base reasoning capabilities in a trade-off with like the agentic ones."

The standard approach is to just... accept it. You train different specialized models, or you carefully balance your training data, or you throw more compute at the problem until it mostly works. GLM-5's approach is different.

Teaching Yourself Using Yourself

GLM-5 uses what they call "on-policy cross-stage distillation." Strip away the jargon and here's what's happening: at each training stage, the previous version of the model becomes the teacher for the current one.

You train the base model, then do supervised fine-tuning. That SFT model becomes the teacher when you do reasoning-focused RL—you're not just optimizing for good reasoning, you're also trying to stay close to what the SFT model knew. Then that reasoning-RL model becomes the teacher for the agentic-RL stage. And so on.

The team discusses a twist: rather than doing this sequentially, they might be using all the expert models as teachers simultaneously in the final training stage. "You train essentially all these expert models right like you have a math expert... and then I suppose you then take all those models as your teachers and then you do the final stage," one researcher suggests.

Why would you do it this way instead of sequentially? The answer reveals the constraints of working at this scale: "This is the 700 billion parameter model. I guess... the other advantage is also if you do it their way, you can kind of parallelize the effort, right? You don't have to wait for one checkpoint to be trained before you can do the next page."

When your training runs cost millions of dollars and take weeks, being able to parallelize matters. A lot.

The Infrastructure Nobody Talks About

The self-distillation trick is clever, but the researchers spent more time discussing the infrastructure required to make any of this work. This is where the real engineering lives.

GLM-5 uses Deep Seek Sparse Attention, which sounds like a buzzword until you realize what it actually does: instead of every token attending to every previous token (which scales quadratically and becomes computationally absurd), an indexer selects which past tokens actually matter. This isn't new, but it's finicky to train and requires "very careful training strategy" according to the discussion.

Then there's the rollout problem. When you're training an AI agent that might spend hours on a single task—calling APIs, waiting for responses, making decisions—you can't freeze your entire training pipeline waiting for it to finish. GLM-5's solution is to decouple the rollout logic (generating responses, using tools) from the training logic (computing gradients, updating weights).

As one researcher explained: "On the one hand like in the generators you can basically define like kind of arbitrarily complex rollout logic and then you have some like API like HTTP thing... and so the nice thing with that right is that you can develop your own rollout logic without having to have the training stuff coupled."

This seems obvious in retrospect—of course you'd want to develop and test these pieces independently—but it's not how most RL frameworks work. The fact that GLM-5 needed to build this suggests we're at a transition point where the old tooling doesn't quite fit the new problems.

When Theory Meets Practice

The conversation kept hitting moments where the paper's claims didn't quite match the researchers' understanding of the underlying systems. GLM-5 claims to do "entirely on-policy" training for the reasoning stage, but also uses asynchronous RL, "which I think it's inherently off policy," one researcher notes.

Another points out: "I've never really seen a paper using these huge models doing like on policy training but I guess they don't mean like synchronous on policy cuz that would be like super inefficient."

This isn't just academic nitpicking. It matters because on-policy versus off-policy training represents different trade-offs between sample efficiency and computational cost. At 700 billion parameters, those trade-offs become existential.

The team also discusses GLM-5's solution to very long rollouts: they just throw them away if they become too off-policy. "They seem to basically keep like a kind of history of like trajectories and then if you imagine that there's like one like roll out that is like taking let's say 3 hours and by then you've done so many updates to the model that it's like super off policy they uh they discard it."

This is the kind of decision you make when theory meets practice at scale. Is it elegant? No. Does it work? Apparently.

Three Ways to Think

One detail that got overlooked in the infrastructure discussion: GLM-5 can operate in three different reasoning modes. It can preserve all its reasoning traces across conversation turns, discard them to save context, or use "preserved thinking" where it keeps reasoning before each tool call.

This matters for a practical reason the researchers identified: "With this kind of thinking mode you need obviously like a way bigger token budget but they train their agents specifically for long context reasoning."

You're not just training one model anymore. You're training a model that can choose how much thinking to show, how much context to maintain, and when to prioritize efficiency over explainability. The model adapts to the task—coding scenarios need preserved thinking, simple queries don't.

The complexity here isn't in any single technique. It's in the orchestration of dozens of techniques, each addressing a specific failure mode that only emerges when you scale up.

What Actually Shipped

GLM-5 ranks among the best closed-source models on standard benchmarks. The researchers confirmed it uses token-in-token-out interfaces rather than text-in-text-out, which prevents tokenization boundary issues during training. They verified the use of FP8 precision and prefill-decode disaggregation for managing long rollouts.

These aren't the sexy ML breakthroughs that make headlines. They're the engineering decisions that determine whether your training run completes or burns millions of dollars producing garbage.

The self-distillation approach—using previous checkpoints as teachers—might be the most interesting contribution. Not because it's revolutionary, but because it's practical at a scale where most "obvious" solutions break down.

We're past the point where training better AI models is primarily a research problem. It's increasingly an infrastructure problem, a systems engineering problem, a "how do we not forget what we learned" problem. GLM-5's solutions might not be elegant, but they're designed for constraints that most researchers never have to think about: what happens when your training run costs millions and you can't afford to fail twice?

Rachel 'Rach' Kovacs