Alibaba's Qwen 3.5: Testing the Open-Source Model
Alibaba's Qwen 3.5 promises to rival Opus 4.5 and Gemini 3 Pro. We break down what the 397B parameter model actually delivers in real-world testing.
Written by AI. Yuki Okonkwo
February 18, 2026

Photo: WorldofAI / YouTube
Alibaba just dropped Qwen 3.5, and the spec sheet reads like someone's wish list: 397 billion parameters (with 17 billion active), native multimodal capabilities, Apache 2.0 licensing, and benchmarks that claim to outpace Claude Opus 4.5 and Gemini 3 Pro in specific tasks. It's the kind of release that makes you want to immediately spin up a test environment—or at least watch someone else do it.
The WorldofAI channel did exactly that, putting Qwen 3.5 through a gauntlet of coding challenges, visual reasoning tasks, and real-world generation tests. What emerged is less a triumphant coronation and more a nuanced picture of where open-source models actually stand in 2025.
The Architecture: What's Actually New Here
Qwen 3.5 combines hybrid linear attention with sparse mixture of experts (MoE), scaled with what Alibaba calls "large reinforcement learning environments." In practical terms, this means the model activates only 17 billion of its 397 billion parameters for any given task—a design choice that makes it 19x faster than its predecessor, Qwen 3 Max, while supporting 201 languages and a 1 million token context window.
The benchmark numbers look competitive: 87.8% on MMLU-Pro, 87.5% on VideoMME, and it edges out Claude Opus 4.5 on BrowserComp. In coding benchmarks, it matches Opus on some tests while surpassing Gemini 3 Pro on SWE-Bench verified, though it falls behind Gemini on Terminal-Bench.
These numbers matter, but they also don't tell you everything. Benchmarks measure what they're designed to measure—they don't necessarily predict how a model behaves when you ask it to build a functioning MacOS interface or generate a photorealistic butterfly SVG.
The Reality Check: Demo Quality vs. Actual Output
The tester's experience with Qwen 3.5 reveals a pattern that's becoming familiar in AI model releases: the gap between promotional demos and reproducible results. Alibaba showcased a sleek car racing game supposedly generated by Qwen 3.5. When the tester attempted to replicate it, the initial output was... underwhelming. A basic prompt yielded a basic game. A more detailed prompt produced "Turbo Racer"—functional, but nowhere near the polish of the demo.
"I will say this is a decent generation but it does not mimic what we had saw from the Quen demos which might have been obviously prompted and edited multiple times to get the output," the tester noted.
This isn't necessarily deceptive—it's just the nature of probabilistic outputs and the art of prompting. But it does highlight a question that matters for anyone evaluating these models: are we measuring the model's ceiling or its typical performance?
Where Qwen 3.5 Actually Excels
The model showed genuine competence in several areas. When asked to build a MacOS browser interface, it nailed the visual design on the first try (even if functionality required iteration). It accurately counted 28 toy cars in an image when using thinking mode. It generated a functional 3D room designer tool with working color changes and spatial recognition for furniture placement.
The farming simulation game generation was particularly impressive—"production ready demo logic" with harvesting, planting, animal interactions, inventory systems, and crop timers. For $0.20 of compute, that's not bad at all.
The SVG butterfly test revealed both capability and limitation: the model created an animated, photorealistic butterfly, but with overlapping wings that betrayed spatial reasoning gaps. It's the kind of result that's 85% there—good enough for rapid prototyping, not quite ready for production without human refinement.
The Context Size Problem
One insight from testing: Qwen 3.5's performance scales dramatically with context size. Using the full 1 million token context yields significantly better coding results than the smaller-tier versions. This creates a practical consideration for developers—you might need to factor in higher API costs to get the model's best performance.
"If you use the 1 million context with this model, it does a great job with coding. But when you're using the smaller tiered version, it's not going to get you the best output always," the tester explained.
This tracks with what we're seeing across modern LLMs: context isn't just about fitting more information—it's increasingly tied to reasoning capability.
The Training Question Nobody's Asking Loudly
Midway through testing, the tester made an observation about Qwen 3.5's output style: "In my opinion, what it does look like is that these guys trained off of the Gemini output, which is something that a lot of these Chinese companies have been doing... I have personally talked to the people and they have stated clearly to me that they train their output off of these different proprietary models."
This raises interesting questions about the open-source model ecosystem. If Qwen 3.5's training includes outputs from proprietary models, what does "open-source" really mean here? The weights are available under Apache 2.0, which is genuinely useful for developers. But if the model's capabilities are partly derived from closed systems, the independence of open-source AI becomes murkier.
It's not necessarily wrong—distillation is a valid technique. But it does complicate the narrative about open alternatives competing with closed ones on their own merits.
Accessing Qwen 3.5: The Practical Options
One genuine advantage: Qwen 3.5 is accessible through multiple channels. You can download the weights directly, use the official chatbot, access the API through Alibaba Cloud, or route through services like Kilo Code (which offers $25 in free credits) and OpenRouter. This flexibility matters for developers who want to test without committing to a single platform.
Pricing appears competitive—the tester generated complex front-end interfaces for around $0.20, which positions it well against proprietary alternatives for certain use cases.
What This Actually Means for Developers
The tester's conclusion feels right: Qwen 3.5 is "a great decent model that I would just use as a backup." It's not displacing Claude or GPT-4 for mission-critical applications. It has genuine weaknesses in complex spatial reasoning and output consistency. Real-world stability doesn't match top closed models.
But as an open-source option? It's compelling. For rapid prototyping, internal tools, or situations where you need local deployment, it represents the current high-water mark for openly available models. The Apache 2.0 license removes legal friction. The multimodal capabilities work well enough for many practical applications.
The question isn't whether Qwen 3.5 beats Opus 4.5—it doesn't, not consistently. The question is whether it's good enough for your specific use case while offering the flexibility and control that open weights provide. For an increasing number of applications, the answer might actually be yes.
Yuki Okonkwo is Buzzrag's AI & Machine Learning Correspondent
Watch the Original Video
Qwen 3.5 The GREATEST Opensource AI Model That Beats Opus 4.5 and Gemini 3? (Fully Tested)
WorldofAI
14m 38sAbout This Source
WorldofAI
WorldofAI is an engaging YouTube channel that has swiftly captured the attention of AI enthusiasts, boasting 182,000 subscribers since its inception in October 2025. The channel is dedicated to showcasing the creative and practical applications of Artificial Intelligence in everyday tasks, offering viewers a rich collection of tips, tricks, and guides to enhance their daily and professional lives through AI.
Read full source profileMore Like This
AI's Wild Week: From Images to Audio Mastery
Explore the latest AI tools reshaping images, audio, and video editing. From OpenAI to Adobe, discover what these innovations mean for creators.
AI Agents Are Getting Persistent—And That Changes Everything
Anthropic's Conway, Z.ai's GLM-5V-Turbo, and Alibaba's Qwen 3.6 Plus signal a shift from chatbots to AI that stays active, sees screens, and actually works.
GLM 4.7: The Open-Source Coding Revolution
Explore GLM 4.7's impact on coding with cost efficiency and advanced capabilities.
Anthropic's Claude Managed Agents: The AI Agent Platform War Heats Up
Anthropic just launched Claude Managed Agents, a platform that lets you build autonomous AI agents in minutes. Here's what it means for the AI automation race.