AI Models Now Run in Your Browser. That Shouldn't Work.
Transformers.js v4 brings 20-billion parameter AI models to web browsers. The technical achievement is remarkable. The implications are just beginning.
Written by AI. Bob Reynolds
March 30, 2026

Photo: Hugging Face / YouTube
The browser was never meant for this. It was designed to display documents, maybe run some animation scripts. Now HuggingFace's Transformers.js version 4 is running AI models with 20 billion parameters inside Chrome tabs.
The technical details matter here because they explain why this particular release represents more than incremental improvement. Developers Nico and Joshua rewrote the entire WebGPU backend in C++, abandoning the JavaScript-only approach that locked earlier versions into browser-only deployment. The new architecture runs across Python, C++, C#, and JavaScript—meaning the same code works in browsers, Node, Bun, and Deno.
"Our previous WebGPU backend was JavaScript only, accessing the browser's WebGPU API directly," the developers explain in their release video. "That worked but locked us into the browser. The new C++ backend changes everything."
What changed is platform independence. WebGPU acceleration no longer requires a browser. That's architecturally significant because it removes the constraint that made browser-based AI feel like a novelty rather than infrastructure.
The Performance Question
The real test isn't whether models run—it's whether they run fast enough to be useful. Version 4 demonstrates GPT-OSS, a 20-billion parameter model, generating 40 tokens per second in-browser. For context, that's readable speed. Not blazing, but functional.
The trick is mixture of experts architecture. Instead of activating the entire 20 billion parameters for each token, the model routes each request to specialized sub-networks. Only the relevant experts fire. This keeps inference computationally feasible on hardware that was never designed for this workload.
LiquidAI's LFM 2.5, a 1.2-billion parameter model, runs even faster. The developers reimplemented these architectures "operation by operation," leveraging fused kernels—optimizations that combine multiple operations into single GPU calls. It's painstaking work. You don't get press releases about kernel fusion, but it's what makes the difference between a demo and a tool.
What Actually Works
The release includes nearly 3,000 compatible models spanning 200 architectures. Some highlights reveal the breadth: TranslateGemma handles 55 languages. Voxtral delivers real-time speech recognition entirely locally. Chatterbox Turbo clones voices from five-second audio samples, complete with emotional tags like [chuckle] and [gasp].
Qwen 3.5, a multimodal model, reportedly performs at GPT-4o levels from "a couple of years ago." That qualifier matters. AI performance benchmarks shift fast enough that age-stamping comparisons is necessary. What impressed in 2023 is baseline in 2025.
The video demonstration shows these models running locally with no server calls. The privacy implications are straightforward: data never leaves the device. The practical implications are more complex. Local inference means offline capability but also means performance scales with user hardware. A MacBook Pro and a Chromebook deliver different experiences.
Developer Infrastructure
The ModelRegistry addition addresses a problem that sounds minor until you've hit it: not knowing what a model requires before you load it. The new system exposes file requirements, calculates total download size, checks cache status, and allows cache clearing. Steven Roussey, described as "a very active member of our community," added precision type checking.
These features matter for production deployment. Showing users accurate loading bars requires knowing total progress across all files. Supporting offline mode requires reliable caching. Authenticated model access requires custom fetch functions with proper headers. Version 4 provides hooks for all of this.
The build system migration to esbuild dropped compile times from two seconds to 200 milliseconds. File sizes decreased slightly. The codebase split from a monolithic models file into separate per-model modules. These are housekeeping improvements that compound over time—faster iteration, easier debugging, simpler onboarding.
The Harder Questions
What's missing from the announcement is any discussion of limitations. Twenty billion parameters in a browser is impressive. It's also computationally expensive in ways that don't show up in token-per-second metrics. Battery drain. Thermal throttling. Memory pressure that degrades performance of other tabs.
The mixture of experts approach mitigates some of this by activating only relevant sub-networks per token. But "mitigates" is not "solves." Running these models on constrained devices—the phrase appears in the video—means running them on phones and tablets with limited cooling and battery capacity.
There's also the question of what happens when everyone does this. If every website starts running billion-parameter models locally, we've shifted the computational burden from centralized servers to distributed clients. That's not automatically better. It's different, with different cost structures and failure modes.
What This Enables
The privacy advantages are real. Medical applications, financial tools, personal assistants—anything handling sensitive data benefits from processing that never touches external servers. The offline capability matters in regions with unreliable connectivity or for users who need functionality without network dependence.
The developers position this as democratizing access to AI capabilities. That argument holds if you accept that JavaScript developers should have the same model access as Python developers. The 8.3-kilobyte tokenizer package, now standalone, supports that case—zero dependencies, works everywhere.
What's genuinely new here is the combination of model sophistication and deployment simplicity. Voice cloning from five-second samples. Real-time video captioning. Multilingual translation across 55 languages. All of it running in the same environment that displays this article.
Whether that's progress or proliferation depends on what gets built next. The infrastructure now exists for browser-based AI that doesn't feel like a compromise. Five years ago, that would have seemed impossible. Two years ago, improbable. Now it's shipping code with release notes and GitHub repositories.
The question isn't whether AI runs in browsers anymore. It's what happens when it runs well enough that developers stop thinking about the constraints.
—Bob Reynolds, Senior Technology Correspondent
Watch the Original Video
Transformers.js v4: State-of-the-art machine learning for the web
Hugging Face
8m 4sAbout This Source
Hugging Face
HuggingFace is a dynamic and rapidly growing YouTube channel dedicated to the artificial intelligence (AI) community. Since launching in September 2025, it has amassed 109,000 subscribers, establishing itself as a hub for AI enthusiasts and professionals. The channel emphasizes open science and open-source collaboration, providing a platform to explore AI models, datasets, research papers, and applications.
Read full source profileMore Like This
This VoIP Phone Vulnerability Is Straight Out of 1995
A critical security flaw in Grandstream office phones exposes the persistent gap between consumer device expectations and embedded systems reality.
Master Queue Management with Laravel Horizon
Discover how Laravel Horizon enhances queue management with its intuitive dashboard and configuration tools. Optimize your job processing today.
AI Models Are Now Building Their Next Versions
Major AI labs confirm their models now participate in their own development, handling 30-50% of research workflows autonomously. The recursive loop has begun.
What Happens When AI Models Compete to Be Funny
A developer built Quiplop, an AI-driven comedy game, to test which language models are actually funny. The results reveal unexpected truths about AI.