Intel's Budget GPU Play: 96GB of VRAM for $2,600

When Alex Ziskind stacked four Intel ARC Pro B60 graphics cards into a server chassis, he created what might be the cheapest path to 96GB of VRAM currently available. At roughly $2,600 for the full set—assuming street prices around $650 per card—it's a fraction of what you'd pay for Nvidia's RTX Pro 6000, which offers the same memory capacity in a single card for $8,500.

The mathematics are straightforward. Each B60 carries 24GB of GDDR6, 456 GB/s of memory bandwidth, and a 200W power envelope. Multiply by four, and you've got enough VRAM to run the kind of large language models that used to require enterprise hardware. The question Ziskind set out to answer is whether this represents genuine value or merely cheap specifications.

The Memory Versus Speed Trade-off

To understand Intel's approach, you need to see what else $650 to $800 buys in the GPU market. AMD's RX7900 XT costs roughly the same but offers only 20GB of VRAM—though it compensates with 800 GB/s of memory bandwidth and 315W of power. Nvidia's RTX Pro 2000 Blackwell, at $800, provides just 16GB of the newest GDDR7 memory with 288 GB/s bandwidth, but sips only 70W from the PCIe bus.

Ziskind tested all three against each other running the same Qwen 34B model in full BF-16 format—an 8GB model that leaves plenty of headroom for context. The Nvidia card, despite being the smallest and lowest-power option, delivered 5,223 tokens per second for prompt processing with just 70W of power draw. The Intel B60 managed 9,576 tokens per second at 120W. The AMD card, running hottest at 400W, produced the fastest token generation at 431 tokens per second under high concurrency loads.

"The B60 is not trying to be the fastest GPU," Ziskind observed. "It's just trying to be the GPU that gives you the most VRAM density for the money."

That distinction matters. When you're running large models, memory capacity determines what fits. Speed determines how fast it runs. Intel chose capacity.

Four Cards, Real Constraints

Ziskind's four-GPU setup consumes 372W at idle—not unreasonable for four 200W cards plus a Xeon processor and server-class components. Under load running a 65GB DeepSeek R1 Qwen 32B model across all four cards, the system peaked at 940W. Each GPU pulled 120-130W, well below their 200W ceiling, at roughly 22% utilization.

Those numbers reveal something important: the cards weren't power-limited or compute-limited. They were somewhere between, handling concurrent requests at 574 tokens per second peak throughput with all 96GB of memory fully utilized. When Ziskind tested concurrency levels—simulating multiple users or agent processes hitting the system simultaneously—he found optimal performance at 64 concurrent requests. Below that, you're not utilizing the hardware. Above it, diminishing returns set in.

The system sustained 289 tokens per second for generation at that concurrency level. For comparison, a single Nvidia RTX 4090 with the same 24GB of VRAM costs over $2,000 and can't be easily stacked four-deep in a standard chassis.

The Software Penalty

Here's where Intel's value proposition hits a wall: ecosystem maturity. To run these cards, Ziskind used Intel's LLM Scaler stack, which lags several versions behind current vLLM releases. "You don't get the latest and greatest models," he noted, "because you kind of have to use their stack."

As of mid-March 2025, the Intel stack supported GLM Flash but not Qwen 3.5, which had become a standard for coding tasks. The lag appears to run about a month behind mainstream releases. For developers who need to experiment with cutting-edge models, that's a meaningful constraint. For production workloads running established models, it's less critical.

Ziskind successfully ran the system as a coding agent backend, connecting it to VS Code for real-time assistance. At 800W system power and 27 tokens per second, it functioned—not impressively fast, but functional for tasks where memory capacity matters more than raw speed.

What This Actually Means

Intel's play here is transparent: deliver maximum VRAM per dollar and let users decide if that trade-off works for their use case. For training runs that need to hold large models in memory, for inference servers handling many concurrent lightweight requests, or for development environments where you're testing different model sizes, 96GB at this price point opens possibilities that didn't exist before.

It doesn't replace Nvidia's ecosystem advantage—the software maturity, the optimization for AI workloads, the immediate support for new models. It doesn't match AMD's raw bandwidth for throughput-constrained tasks. What it does is make a specific capability affordable: the ability to load and run models that simply won't fit on anything else in this price range.

Ziskind's testing exposed the practical limits. The system runs hot enough that he moved it off his desk. It sounds like an airplane under load. Some models failed at higher concurrency levels. The software stack requires patience. But for $2,600, you can run 65GB models with room for substantial context—something that would cost three times as much with Nvidia's current offerings.

The real question isn't whether Intel's approach is better than Nvidia's or AMD's. It's whether maximizing VRAM per dollar serves your specific workload better than maximizing bandwidth or compute density. Intel is betting that for enough people, the answer is yes.

Bob Reynolds is Senior Technology Correspondent for Buzzrag.