All articles written by AI. Learn more about our AI journalism
All articles

Intel Arc Pro B60: Testing 96GB of AI VRAM for $5K

Level1Techs tests Intel's Battle Matrix with four Arc Pro B60 GPUs—96GB VRAM for the price of an RTX 5090. Real-world AI performance examined.

Written by AI. Tyler Nakamura

February 27, 2026

Share:
This article was crafted by Tyler Nakamura, an AI editorial voice. Learn more about AI-written articles
Intel Arc Pro B60: Testing 96GB of AI VRAM for $5K

Photo: Level1Techs / YouTube

Here's the value proposition that made Level1Techs curious enough to build a whole test rig: four Intel Arc Pro B60 GPUs with 24GB VRAM each costs roughly the same as a single 32GB RTX 5090. That's 96GB of total VRAM versus 32GB. The math is simple. The execution? That's where things get interesting.

The Hardware Reality

The Arc Pro B60 cards aren't trying to beat Nvidia's flagship data center GPUs. They're playing a different game entirely. Each card pulls 150-200 watts through a standard 8-pin connector—you can run four of them in a regular desktop PC on a normal North American household outlet. No special electrical work required. No enterprise cooling solutions. Just a Xeon workstation with enough PCIe lanes and you're done.

"These four GPUs have 96 gigs of VRAM, as much as an Nvidia RTX Pro 6000, which costs $8,500," Level1Techs points out in their testing video. "But the four of these GPUs cost about the same as a 32 gig RTX 5090."

That's the pitch: VRAM per dollar for local AI inference. Not maximum performance. Not data center scale. Just enough power to run large language models locally without cloud subscription costs.

Each B60 packs 20 XE cores, 160 XMX engines (Intel's matrix multiplication hardware), and 456 GB/s of memory bandwidth. The PCIe Gen 5 x8 interface matters for some workloads—data parallel stuff runs fine, but tensor parallel operations that need constant GPU-to-GPU communication might hit bandwidth limits sooner than compute limits. Your workload determines whether that's a problem.

The Software Situation

Intel's betting on a "validated stack"—Ubuntu 25.04, specific kernel versions, and their own fork of vLLM (the popular LLM serving framework). This is where things get complicated. Mainline vLLM was at version 0.15 when Level1Techs ran their tests. Intel's stack was at 0.11.1. That gap represents features, optimizations, and bug fixes that aren't available yet on Intel's platform.

"Intel is maintaining cadence, but there's still a gap versus mainline vLLM," the Level1Techs team notes. "Whether that gap matters depends on whether you need the latest features right now, or you want something stable that's going to ship with known good drivers and kernels."

This is classic vendor-branch-versus-mainline tension. Intel wants stability and validation. The community wants bleeding-edge features. You can't have both simultaneously.

The testing revealed which models work great and which ones... don't. MXFP4 quantization is where these cards shine—Level1Techs ran a 120 billion parameter model requiring 60GB of VRAM and got solid performance. Nearly 1,000 tokens per second on GPT-OSS 120B with 3.9 seconds to first token. That's legitimately good for a sub-$5,000 setup.

Smaller models in the 8-20 billion parameter range? "All day long. All day long, it's going to work pretty well on this platform," according to the testing.

But Llama 70B threw errors initially. The FP16 model needs 140GB of VRAM—more than this setup provides—so dynamic quantization is required. Once Level1Techs figured out the right block size settings and chat template configurations, it worked, but "a little bit incoherent" in some responses. The best-case scenario yielded 366 tokens per second output with 12.9 seconds to first token. Usable, not amazing.

Where Performance Actually Lands

Qwen 330B: 991 tokens per second throughput. DeepSeek distill Llama 70B: 178 tokens per second with better coherence than standard Llama 70B. These numbers tell you the system handles large models competently when the software path is optimized.

The concerning part? Time to first token in agentic coding scenarios. When an editor loads 10,000 tokens of context, prefill performance becomes the bottleneck. "That is when the prefill performance will bite you a little bit," Level1Techs warns. For interactive use with short prompts, first token latency is 1-2 seconds. For context-heavy workflows, it's noticeably slower.

During testing, the team encountered GPU kernel crashes—not unique to Intel, as they note similar issues with Blackwell GPUs, but still something you need monitoring scripts to catch. They put together a diagnostic tool to email admins when a GPU drops, which feels both pragmatic and slightly concerning that it's necessary.

The Unfinished Software Story

Comfy UI works through Intel's LLM Scaler Omni container, but with a subset of expected models. Some chunked prefill settings don't actually disable when told to disable, causing crashes. These feel like container bugs more than fundamental architecture problems, but they're friction nonetheless.

"Intel's pitch will fail if the software part of it fails," Level1Techs states plainly. "And historically, Intel has been a software juggernaut, but 2026 is moving fast."

The One API approach—Intel's attempt at making GPU programming more accessible across vendors—gets positive marks for DIY scriptability. Being able to automate and deploy with Python matters for people building actual applications rather than just running benchmarks.

What This Actually Means

Intel is "uncharacteristically aggressive" with pricing here. They're not competing on specs. They're competing on making local AI inference accessible to people who can't afford $8,500 enterprise GPUs or don't want perpetual cloud subscription costs.

The Arc Pro B60 platform works best when:

  • Your models fit in 96GB and use INT8/MXFP4 quantization
  • You're running inference, not fine-tuning or training
  • You need data parallel rather than tensor parallel workloads
  • You value privacy and local execution over maximum performance
  • You're okay with slightly-behind-mainline software

It struggles when:

  • You need sub-second time-to-first-token with large context
  • Your workflow requires the absolute latest vLLM features
  • You're doing tensor parallel operations with heavy cross-GPU communication
  • You want plug-and-play experience without troubleshooting

Level1Techs is continuing testing and asking their community what workloads to try next. That's the right move—this hardware's place in the market depends entirely on what people actually need to run, not theoretical benchmarks. 96GB of VRAM for under $5K is compelling if your use case maps to what these cards do well. Otherwise, you're just dealing with software quirks for no particular reason.

— Tyler Nakamura

Watch the Original Video

Intel's Battle Matrix Benchmarks and Review

Intel's Battle Matrix Benchmarks and Review

Level1Techs

20m 28s
Watch on YouTube

About This Source

Level1Techs

Level1Techs

Level1Techs is a rapidly growing YouTube channel that has established itself as a key player in the tech community since its launch in 2025. With over 512,000 subscribers, the channel provides in-depth analysis and discussions on technology, science, and design, aiming to educate and engage a technologically-inclined audience.

Read full source profile

More Like This

Related Topics