Chinese AI Models Are Suddenly Catching Up

So here's the thing nobody really wants to say out loud: the AI gap between China and the US might not exist anymore. At least not in the way we've been telling ourselves it does.

Zhipu AI just released GLM-5, a 744-billion parameter open-source model that's claiming some genuinely eyebrow-raising numbers. The headline stat? It scored -1 on something called the AA omniscience index, which sounds confusing until you realize negative is actually good here—it means the model is extremely good at saying "I don't know" instead of hallucinating an answer. That's a 35-point leap over their last version, and they're claiming it now leads the entire AI industry in reliability. Even ahead of major US models.

That last part is where things get interesting, because we're used to thinking about Chinese AI labs as playing catch-up. But GLM-5 isn't positioned as "almost as good as" anything. It's positioned as better—at least on certain metrics that enterprise users actually care about, like not making stuff up when you ask it a question.

The Slime Engine Makes Training Less Painful

The technical innovation here is something called Slime, which is their custom reinforcement learning engine. Normal RL training has this annoying bottleneck problem where one slow task can hold everything else back, kind of like that one person in a group project who never responds to messages. Slime fixes that by letting many training attempts run separately instead of all waiting on each other.

They also built in something called April that tackles what apparently takes up more than 90% of training time. It's basically three parts working together: one system trains the model, another generates examples for it to learn from, and a central hub manages all that data. The result is a model that can learn from long, iterative tasks—try something, see what happened, adjust, try again. More like how humans actually learn, less like a chatbot spitting out a single answer and calling it done.

The practical deployment piece is a 200,000-token context window, which changes what enterprise AI even means. You can throw giant documents, long briefs, multiple files, and huge code chunks into a single run without the model losing track. That's not just a bigger number—it's a different category of usefulness.

Agent Mode: From Chat to Actual Work

But the part that made people really pay attention is what Zhipu AI calls "end-to-end knowledge work." GLM-5 has native agent mode capabilities that can turn prompts directly into actual files—.docx, .pdf, .xlsx. Not a paragraph you still have to reformat. A file you can actually use or send.

They describe the workflow as "humans set the quality gates, the AI executes the subtasks." Which is a very optimistic framing for what Lucas Peterson at Anden Labs calls "incredibly effective but less situationally aware," achieving goals with "aggressive tactics instead of reasoning about context or learning from experience." He drops the paperclip maximizer reference, which is the classic AI safety scenario where an autonomous system pursues a goal so hard it causes massive harm because it doesn't understand what matters outside that objective.

That tension—between "this is amazing" and "this is scary"—is kind of the whole story right now.

Benchmarks and Market Reactions

The benchmark claims are aggressive. GLM-5 reportedly hits 77.8% on SWE-bench Verified, beating Gemini 3 Pro at 76.2% and getting close to Claude Opus 4.6 at 80.9%. On VendingBench 2, described as a business simulation, it ranks number one among open-source models. Artificial analysis currently ranks it as the strongest open-source model available.

And here's where it gets weird: these releases are actually moving markets now. Zhipu's shares jumped as much as 34% in Hong Kong after the GLM-5 launch. Demand for their GLM coding plan surged, leading to a roughly 30% price increase. Investors are starting to treat new AI systems as real competitive threats to existing industries, not just interesting research projects.

Pricing is part of the strategy. GLM-5 is live on Open Router at around $1 per million input tokens and $3 per million output tokens. Compare that to Claude Opus 4.6 at $5 input and $25 output, and you're looking at roughly five times cheaper on input, almost ten times cheaper on output. For frontier performance, that's not just competitive—it's disruptive.

Meanwhile, OpenAI Is Building Skills

While Chinese labs are launching, OpenAI is reportedly revamping how people actually use its tools. The Deep Research feature in ChatGPT got an upgrade that moves it from "run it and wait" to a guided research session you can actually steer. You can now constrain research to specific websites, bring in context from connected apps, and interrupt mid-run to add requirements or redirect without restarting.

The backend apparently moved to GPT-5.2, aligning with a strategy focused on agent-like workflows—browsing, synthesis, tool access, iterative control. Not just single-shot answers.

The bigger rumor is something called Skills—a first-party layer where you can install and edit workflow instructions packaged as reusable modules. The assistant follows a consistent playbook instead of reinventing how it works every time. If that ships, it becomes a native way for teams to standardize workflows and maintain consistent results without building a full custom agent stack.

Open-Source Agents Are Hitting Human-Level Scores

And then there's the open-source agent space, which just pulled off something genuinely surprising. A project called OpenJuan spawned two agents—DeepAgent and DeepSearch—that are suddenly topping major leaderboards.

DeepAgent hit 91.69% on the GAIA benchmark, which tests real agent skills like planning, reasoning, tool use, and problem-solving. Humans average about 92% there. For comparison, GPT-4 with plugins reportedly landed around 15%. One demo shows DeepAgent taking a cooking video, pulling out ingredients, finding them online, comparing prices, adding them to a cart, and handing control back right before payment. Clean, real-world workflow execution.

The reason it works comes down to system design. It runs two internal loops—one that plans and executes steps, another that watches results and fixes mistakes. If something goes wrong, it can roll back, adjust, and try again. It also keeps layered memory and compresses context so long tasks don't fall apart halfway through.

DeepSearch is leading the BrowseComp++ benchmark with about 80% accuracy on deep research tasks. Instead of guessing a path and hoping it works, it explores several at once and keeps the smartest one.

What Actually Changes Here

So what does all this mean? Honestly, it's hard to say with certainty, which is part of what makes it interesting. The performance claims need independent verification. The safety concerns aren't hypothetical. The market reactions might be hype-driven.

But the broader pattern is pretty clear: the AI race just got a lot more crowded, the gap between Chinese and US models is narrowing faster than expected, and we're watching a shift from "AI that chats" to "AI that does work." Whether that work gets done safely, whether the benchmarks translate to real-world reliability, whether open-source can actually compete with closed systems—those are all open questions.

What's not really in question anymore is whether Chinese AI labs can build frontier models. They can. They are. And they're pricing them to win.

—Zara Chen, Tech & Politics Correspondent