Alibaba's Qwen 3.5: Testing the Open-Source Model

Alibaba just dropped Qwen 3.5, and the spec sheet reads like someone's wish list: 397 billion parameters (with 17 billion active), native multimodal capabilities, Apache 2.0 licensing, and benchmarks that claim to outpace Claude Opus 4.5 and Gemini 3 Pro in specific tasks. It's the kind of release that makes you want to immediately spin up a test environment—or at least watch someone else do it.

The WorldofAI channel did exactly that, putting Qwen 3.5 through a gauntlet of coding challenges, visual reasoning tasks, and real-world generation tests. What emerged is less a triumphant coronation and more a nuanced picture of where open-source models actually stand in 2025.

The Architecture: What's Actually New Here

Qwen 3.5 combines hybrid linear attention with sparse mixture of experts (MoE), scaled with what Alibaba calls "large reinforcement learning environments." In practical terms, this means the model activates only 17 billion of its 397 billion parameters for any given task—a design choice that makes it 19x faster than its predecessor, Qwen 3 Max, while supporting 201 languages and a 1 million token context window.

The benchmark numbers look competitive: 87.8% on MMLU-Pro, 87.5% on VideoMME, and it edges out Claude Opus 4.5 on BrowserComp. In coding benchmarks, it matches Opus on some tests while surpassing Gemini 3 Pro on SWE-Bench verified, though it falls behind Gemini on Terminal-Bench.

These numbers matter, but they also don't tell you everything. Benchmarks measure what they're designed to measure—they don't necessarily predict how a model behaves when you ask it to build a functioning MacOS interface or generate a photorealistic butterfly SVG.

The Reality Check: Demo Quality vs. Actual Output

The tester's experience with Qwen 3.5 reveals a pattern that's becoming familiar in AI model releases: the gap between promotional demos and reproducible results. Alibaba showcased a sleek car racing game supposedly generated by Qwen 3.5. When the tester attempted to replicate it, the initial output was... underwhelming. A basic prompt yielded a basic game. A more detailed prompt produced "Turbo Racer"—functional, but nowhere near the polish of the demo.

"I will say this is a decent generation but it does not mimic what we had saw from the Quen demos which might have been obviously prompted and edited multiple times to get the output," the tester noted.

This isn't necessarily deceptive—it's just the nature of probabilistic outputs and the art of prompting. But it does highlight a question that matters for anyone evaluating these models: are we measuring the model's ceiling or its typical performance?

Where Qwen 3.5 Actually Excels

The model showed genuine competence in several areas. When asked to build a MacOS browser interface, it nailed the visual design on the first try (even if functionality required iteration). It accurately counted 28 toy cars in an image when using thinking mode. It generated a functional 3D room designer tool with working color changes and spatial recognition for furniture placement.

The farming simulation game generation was particularly impressive—"production ready demo logic" with harvesting, planting, animal interactions, inventory systems, and crop timers. For $0.20 of compute, that's not bad at all.

The SVG butterfly test revealed both capability and limitation: the model created an animated, photorealistic butterfly, but with overlapping wings that betrayed spatial reasoning gaps. It's the kind of result that's 85% there—good enough for rapid prototyping, not quite ready for production without human refinement.

The Context Size Problem

One insight from testing: Qwen 3.5's performance scales dramatically with context size. Using the full 1 million token context yields significantly better coding results than the smaller-tier versions. This creates a practical consideration for developers—you might need to factor in higher API costs to get the model's best performance.

"If you use the 1 million context with this model, it does a great job with coding. But when you're using the smaller tiered version, it's not going to get you the best output always," the tester explained.

This tracks with what we're seeing across modern LLMs: context isn't just about fitting more information—it's increasingly tied to reasoning capability.

The Training Question Nobody's Asking Loudly

Midway through testing, the tester made an observation about Qwen 3.5's output style: "In my opinion, what it does look like is that these guys trained off of the Gemini output, which is something that a lot of these Chinese companies have been doing... I have personally talked to the people and they have stated clearly to me that they train their output off of these different proprietary models."

This raises interesting questions about the open-source model ecosystem. If Qwen 3.5's training includes outputs from proprietary models, what does "open-source" really mean here? The weights are available under Apache 2.0, which is genuinely useful for developers. But if the model's capabilities are partly derived from closed systems, the independence of open-source AI becomes murkier.

It's not necessarily wrong—distillation is a valid technique. But it does complicate the narrative about open alternatives competing with closed ones on their own merits.

Accessing Qwen 3.5: The Practical Options

One genuine advantage: Qwen 3.5 is accessible through multiple channels. You can download the weights directly, use the official chatbot, access the API through Alibaba Cloud, or route through services like Kilo Code (which offers $25 in free credits) and OpenRouter. This flexibility matters for developers who want to test without committing to a single platform.

Pricing appears competitive—the tester generated complex front-end interfaces for around $0.20, which positions it well against proprietary alternatives for certain use cases.

What This Actually Means for Developers

The tester's conclusion feels right: Qwen 3.5 is "a great decent model that I would just use as a backup." It's not displacing Claude or GPT-4 for mission-critical applications. It has genuine weaknesses in complex spatial reasoning and output consistency. Real-world stability doesn't match top closed models.

But as an open-source option? It's compelling. For rapid prototyping, internal tools, or situations where you need local deployment, it represents the current high-water mark for openly available models. The Apache 2.0 license removes legal friction. The multimodal capabilities work well enough for many practical applications.

The question isn't whether Qwen 3.5 beats Opus 4.5—it doesn't, not consistently. The question is whether it's good enough for your specific use case while offering the flexibility and control that open weights provide. For an increasing number of applications, the answer might actually be yes.

Yuki Okonkwo is Buzzrag's AI & Machine Learning Correspondent