Intel Arc Pro B60: Testing 96GB of AI VRAM for $5K

Here's the value proposition that made Level1Techs curious enough to build a whole test rig: four Intel Arc Pro B60 GPUs with 24GB VRAM each costs roughly the same as a single 32GB RTX 5090. That's 96GB of total VRAM versus 32GB. The math is simple. The execution? That's where things get interesting.

The Hardware Reality

The Arc Pro B60 cards aren't trying to beat Nvidia's flagship data center GPUs. They're playing a different game entirely. Each card pulls 150-200 watts through a standard 8-pin connector—you can run four of them in a regular desktop PC on a normal North American household outlet. No special electrical work required. No enterprise cooling solutions. Just a Xeon workstation with enough PCIe lanes and you're done.

"These four GPUs have 96 gigs of VRAM, as much as an Nvidia RTX Pro 6000, which costs $8,500," Level1Techs points out in their testing video. "But the four of these GPUs cost about the same as a 32 gig RTX 5090."

That's the pitch: VRAM per dollar for local AI inference. Not maximum performance. Not data center scale. Just enough power to run large language models locally without cloud subscription costs.

Each B60 packs 20 XE cores, 160 XMX engines (Intel's matrix multiplication hardware), and 456 GB/s of memory bandwidth. The PCIe Gen 5 x8 interface matters for some workloads—data parallel stuff runs fine, but tensor parallel operations that need constant GPU-to-GPU communication might hit bandwidth limits sooner than compute limits. Your workload determines whether that's a problem.

The Software Situation

Intel's betting on a "validated stack"—Ubuntu 25.04, specific kernel versions, and their own fork of vLLM (the popular LLM serving framework). This is where things get complicated. Mainline vLLM was at version 0.15 when Level1Techs ran their tests. Intel's stack was at 0.11.1. That gap represents features, optimizations, and bug fixes that aren't available yet on Intel's platform.

"Intel is maintaining cadence, but there's still a gap versus mainline vLLM," the Level1Techs team notes. "Whether that gap matters depends on whether you need the latest features right now, or you want something stable that's going to ship with known good drivers and kernels."

This is classic vendor-branch-versus-mainline tension. Intel wants stability and validation. The community wants bleeding-edge features. You can't have both simultaneously.

The testing revealed which models work great and which ones... don't. MXFP4 quantization is where these cards shine—Level1Techs ran a 120 billion parameter model requiring 60GB of VRAM and got solid performance. Nearly 1,000 tokens per second on GPT-OSS 120B with 3.9 seconds to first token. That's legitimately good for a sub-$5,000 setup.

Smaller models in the 8-20 billion parameter range? "All day long. All day long, it's going to work pretty well on this platform," according to the testing.

But Llama 70B threw errors initially. The FP16 model needs 140GB of VRAM—more than this setup provides—so dynamic quantization is required. Once Level1Techs figured out the right block size settings and chat template configurations, it worked, but "a little bit incoherent" in some responses. The best-case scenario yielded 366 tokens per second output with 12.9 seconds to first token. Usable, not amazing.

Where Performance Actually Lands

Qwen 330B: 991 tokens per second throughput. DeepSeek distill Llama 70B: 178 tokens per second with better coherence than standard Llama 70B. These numbers tell you the system handles large models competently when the software path is optimized.

The concerning part? Time to first token in agentic coding scenarios. When an editor loads 10,000 tokens of context, prefill performance becomes the bottleneck. "That is when the prefill performance will bite you a little bit," Level1Techs warns. For interactive use with short prompts, first token latency is 1-2 seconds. For context-heavy workflows, it's noticeably slower.

During testing, the team encountered GPU kernel crashes—not unique to Intel, as they note similar issues with Blackwell GPUs, but still something you need monitoring scripts to catch. They put together a diagnostic tool to email admins when a GPU drops, which feels both pragmatic and slightly concerning that it's necessary.

The Unfinished Software Story

Comfy UI works through Intel's LLM Scaler Omni container, but with a subset of expected models. Some chunked prefill settings don't actually disable when told to disable, causing crashes. These feel like container bugs more than fundamental architecture problems, but they're friction nonetheless.

"Intel's pitch will fail if the software part of it fails," Level1Techs states plainly. "And historically, Intel has been a software juggernaut, but 2026 is moving fast."

The One API approach—Intel's attempt at making GPU programming more accessible across vendors—gets positive marks for DIY scriptability. Being able to automate and deploy with Python matters for people building actual applications rather than just running benchmarks.

What This Actually Means

Intel is "uncharacteristically aggressive" with pricing here. They're not competing on specs. They're competing on making local AI inference accessible to people who can't afford $8,500 enterprise GPUs or don't want perpetual cloud subscription costs.

The Arc Pro B60 platform works best when:

Your models fit in 96GB and use INT8/MXFP4 quantization
You're running inference, not fine-tuning or training
You need data parallel rather than tensor parallel workloads
You value privacy and local execution over maximum performance
You're okay with slightly-behind-mainline software

It struggles when:

You need sub-second time-to-first-token with large context
Your workflow requires the absolute latest vLLM features
You're doing tensor parallel operations with heavy cross-GPU communication
You want plug-and-play experience without troubleshooting

Level1Techs is continuing testing and asking their community what workloads to try next. That's the right move—this hardware's place in the market depends entirely on what people actually need to run, not theoretical benchmarks. 96GB of VRAM for under $5K is compelling if your use case maps to what these cards do well. Otherwise, you're just dealing with software quirks for no particular reason.

— Tyler Nakamura