Decoding the Fastest Machines for Token Generation
Exploring GPU performance in generating 1M tokens and energy efficiency.
Written by AI. Dev Kapoor
January 2, 2026

Photo: Alex Ziskind / YouTube
When Alex Ziskind decided to pit a variety of machines against each other in a race to generate one million tokens, the results were a fascinating blend of expected outcomes and surprising revelations. This isn't just a story about speed—it's about understanding the complex interplay of hardware capabilities, software tools, and energy consumption that define modern computing tasks.
The Contenders
Ziskind's lineup included a mix of budget and high-end GPUs, each with its own strengths and weaknesses. On the budget end, the AMD Radeon 960 XT stood out, while the high-end arena was dominated by the DJX Spark, Beink GTR9, and the Mac Studio with its formidable M3 Ultra setup.
"I wanted to answer one simple question. Which of these machines can generate 1 million tokens the fastest?" Ziskind states, setting the stage for this technological showdown.
Benchmarking and Fair Play
To ensure a level playing field, all machines were tested with the same four billion parameter model, Quen 3 4B. This allowed the experiment to focus on the hardware's raw ability to process and generate tokens efficiently.
Ziskind explains, "Some things need to be equal, right? They need to all fit inside the smallest common denominator, which is this 16 GB GPU."
The Role of Software and Quantization
It's not just about hardware muscle, though. The choice of software libraries, such as Llama CPP, VLM, and MLX, plays a significant role in performance. Each library is optimized for specific hardware, affecting both speed and energy efficiency. The decision to use quantized models, like 8-bit versions, also dramatically influences throughput.
Ziskind notes, "You’re probably going to run something quantized like an 8bit version or FP8. And the difference between FP8, which is floating point 8, and integer 8 is something I talked about in other videos."
Power Consumption: The Hidden Cost
Beyond speed, Ziskind pays close attention to the power consumption of each setup. The Mac Studio, for instance, showed impressive efficiency during idle times, using as little as 9 watts. In contrast, the DJX Spark, though powerful, consumed significantly more power when idle.
"This is my custom bench with the Radeon machine. 53 watts being used by that box. Total, that’s without running anything," he explains, highlighting the importance of considering energy use in performance evaluations.
Results and Reflections
As the race unfolded, the DJX Spark emerged as the fastest, completing the task in just 6.7 minutes with a throughput of 2,451 tokens per second. Despite its speed, the Spark’s high idle power consumption suggests that its efficiency shines only under load.
Surprisingly, the modest Radeon card performed admirably, proving that even budget options can hold their own in certain contexts. Meanwhile, the Mac Studio's blend of speed and energy efficiency makes it a compelling choice for many users.
Raw Speed Alone Won't Win the Inference Race
Ziskind's experiment underscores a crucial point: in the realm of machine learning and AI, performance is a multifaceted concept. It's not just about raw speed—energy consumption, software compatibility, and cost all play pivotal roles in determining the best tool for the job.
As Ziskind reflects, "Fast rigs are great, but reliability comes from fundamentals. That’s why I keep sharpening the basics." It's a reminder that in the ever-evolving world of technology, understanding the full picture is essential to making informed choices.
By Dev Kapoor for Buzzrag.
Watch the Original Video
Fastest 1000000 tokens
Alex Ziskind
18m 29sAbout This Source
Alex Ziskind
Alex Ziskind is a seasoned software developer turned content creator, captivating an audience of over 425,000 subscribers with his tech-savvy insights and humor-infused reviews. With more than 20 years in the coding realm, Alex's YouTube channel serves as a digital playground for developers eager to explore software enigmas and tech trends.
Read full source profileMore Like This
Intel's Budget GPU Play: 96GB of VRAM for $2,600
Four Intel ARC Pro B60 cards deliver 96GB of VRAM at a fraction of Nvidia's cost. But cheap memory doesn't guarantee useful performance.
Dozzle: The Docker Log Viewer That Does Less (On Purpose)
Dozzle is a 7MB tool that streams Docker logs to your browser. No storage, no database, no complexity. Better Stack shows why that's the point.
Intel's B70 GPU: Where Hardware Promise Meets Software Reality
Intel's Arc Pro B70 outperforms pricier competitors on paper, but the software stack tells a different story. Real-world benchmarks reveal what matters.
Anthropic's Claude Mythos Leaks: What We Know So Far
A leaked draft reveals Anthropic's most powerful AI model yet. The company's cautious rollout raises questions about what makes this one different.