Hugging Face Just Made GPU Kernels Way Less

If you've ever tried to install FlashAttention on a Google Colab instance and watched two hours of your life disappear into CMake errors, Ben Burtenshaw from Hugging Face has some extremely good news for you.

Burtenshaw recently presented a talk on Hugging Face's Kernels ecosystem—a new infrastructure project that's trying to solve one of deep learning's most annoying problems: custom GPU kernels are incredibly powerful, but they're also incredibly painful to actually use. The kind of painful that makes machine learning engineers question their career choices while staring at build logs.

The Memory Problem Nobody Talks About

Here's something that sounds counterintuitive: your GPU probably isn't working that hard. Not because you're doing something wrong, but because most deep learning operations are memory-bound, not compute-bound.

Burtenshaw breaks down the math: "If we took a modern H100 GPU... it could do a petaflop per second of computation, but the memory bandwidth is 3 terabytes per second. So that's a 300 to 1 ratio."

What that means in practice: your H100 could theoretically compute 100 times faster if the data was just... there. Ready to go. But instead, most of the time is spent shuffling data around—moving tensors, reading from memory, writing back to memory. The GPU sits there, metaphorically tapping its fingers, waiting for something to actually compute.

This is where custom kernels come in. Tools like FlashAttention optimize by keeping data in fast SRAM and doing as many operations as possible before writing anything back to slower memory. Instead of reading data for each operation separately, you read once, compute everything, write once. It's the difference between making ten trips to the grocery store versus doing all your shopping in one go.

FlashAttention specifically takes an O(N²) process down to O(N), and it's being used everywhere—post-training examples, tutorials, pretty much any modern transformer implementation. It works. The problem is actually getting it to work on your machine.

The Installation Hell Problem

Burtenshaw describes the current state of kernel distribution as, essentially, a mess. Every kernel project has its own structure, its own build conventions, its own way of doing things. CMake here, Bazel there, Meson somewhere else. FlashAttention can take two hours to install and requires 96GB of RAM just to build.

And then there's the support matrix situation, which is genuinely wild. You've got PyTorch versions (2.5 to 2.8), CUDA versions (11.8 to 12.8), Python versions, different GPU architectures from V100 to H100. Every combination needs to work, and very few actually do out of the box.

The real kicker: "Model and hardware authors as they release new models are self-motivated to align new models with new hardware," Burtenshaw points out. "But as they move forward through the field, the motivation to support old—brackets—cheap hardware is no longer there."

Translation: If you're not running the latest GPUs, you're increasingly on your own. The community needs infrastructure that works across hardware generations, not just the bleeding edge.

The Hugging Face Solution

The Kernels ecosystem that Burtenshaw's team built has two main components: kernel-builder for people creating kernels, and kernels for people using them. The goal is simple: get from CMake errors to one-line kernel usage.

Here's what they did:

Standardized structure: Every kernel project follows the same layout—build.toml for configuration, C source for CUDA code, flake.nix for reproducible builds, torch extension for Python wrapping. No more hunting through documentation to figure out how this particular kernel wants to be compiled.

Reproducible builds with Nix: This is the clever part. Nix lets you define hermetic builds—all dependencies pinned, completely reproducible across environments. You don't need access to specific hardware to build for it. The system handles the build matrices automatically.

Hub distribution: Kernels get pushed to the Hugging Face Hub just like models. The system auto-generates builds for all declared support configurations. Users pull them with a repo ID. The Python client automatically figures out which build matches your environment.

The result: "The get kernel for FlashAttention 3 takes down build on a Colab instance from two hours to two and a half seconds," Burtenshaw says.

Two and a half seconds. That's not a typo.

What This Looks Like In Practice

The usage pattern is deliberately simple. You can verify compatibility before doing anything:

from kernels import versions
versions("kernels-community/activation")

This checks your environment against what's available on the Hub and tells you what works. Then you just... use it:

from kernels import get_kernel
activation = get_kernel("kernels-community/activation")

For PyTorch layers, you can kernelize them with decorators and config mappings. Specify which kernel to use for which hardware (CUDA, AMD ROCm, Intel XPU, Apple Metal), and the system handles the rest.

The really slick integration is with Transformers. If you're using a model that has kernel support defined (like GPT OSS 20B), you just add use_kernels=True when loading the model. Done. All the optimized kernels that model supports get loaded automatically.

Burtenshaw notes that because Transformers is fully modular now, this compounds nicely: "If they use rope embeddings, for example, a lot of models now reuse llama for rope embeddings... then if the kernels were defined for that rope embedding, whenever a model reuses that... those kernels will be attached."

The Usage Numbers

The ecosystem is seeing about 31,000 monthly downloads of kernels, with VLM FlashAttention 3 being the most popular. The performance gains are exactly what you'd expect—significant speedups as batch sizes increase, the kind of improvements that actually matter when you're training or doing inference at scale.

What's interesting isn't just the raw performance—we already knew optimized kernels were fast. It's that people are actually using them now, because the barrier to entry dropped from "spend hours fighting build systems" to "add one line of code."

The Democratization Angle

There's something worth noting about the broader implications here. Custom kernels have historically been the domain of people who really understand GPU architecture, CUDA programming, build systems. That's a small group. Most machine learning engineers can benefit from these optimizations without necessarily understanding how they work at the metal level.

By standardizing distribution and making installation trivial, Hugging Face is essentially democratizing access to optimizations that previously required significant infrastructure expertise. You can now get FlashAttention's O(N) complexity in the time it takes to make coffee, not the time it takes to debug CMake.

The Kernels ecosystem also addresses the older hardware problem. Community members can maintain kernels for hardware that manufacturers have moved on from, keeping older GPUs viable for longer. Given that hardware costs are a major barrier to entry in ML, that's not nothing.

Burtenshaw's right that this sets a lower barrier to entry for kernel usage. The question is whether it also enables a lower barrier to entry for kernel creation. The standardized tooling and reproducible builds suggest it might—but that's a different, harder problem that'll take more time to evaluate.

The documentation lives at huggingface.co/docs/kernels, the repos are on GitHub, and the kernels-community organization on the Hub has the growing collection of available kernels. Worth exploring if you're tired of watching progress bars that aren't actually progressing.

— Zara Chen