Inside Google's TPU Infrastructure: 9,216 Chips

Here's a stat that made me do a double-take: Google's seventh-generation TPU, Ironwood, can connect 9,216 chips in a single pod. Not across multiple systems or cloud regions—in one pod. And when Kavitha Gowda, the product manager for TPUs on Google Kubernetes Engine, walked through the performance numbers, my first reaction was: "This has to be a typo."

It wasn't. The TFLOPS jumped from thousands to seven thousand. The HBM bandwidth went vertical. The leap from previous generations was so massive it broke the trend line.

But here's what's fascinating: those 9,216 chips aren't the full story. The real innovation is how Google treats them—not as thousands of individual components, but as a single atomic unit. And that shift in thinking changes everything about how AI infrastructure scales.

The Matrix Math Wizard

TPUs are Google's custom ASICs (Application-Specific Integrated Circuits) built specifically for one thing: matrix multiplication. Gowda breaks down what makes them different: "MXU is the hardware that makes TPUs so powerful. It's dedicated matrix math wizard that can perform this massive calculation in a single step, making the entire process thousands times faster and more efficient than a general-purpose chip."

To recognize a single image, she explains, takes billions of these matrix operations—finding curved edges, straight lines, deeper patterns like eyes. Each filter is a matrix multiplication. TPUs are basically speedrunning the math that deep learning can't avoid.

The chip comes equipped with high-bandwidth memory (HBM) that lets large models and batch sizes process directly on the chip, dodging data transfer bottlenecks. And they're interconnected with high-speed interchip links that let you scale from one chip to... well, 9,216.

Slices, Pods, and the Atomic Unit Problem

Here's where it gets architecturally interesting. TPUs come in three configurations that matter:

Single-host TPUs are essentially one VM with one to eight chips, operating with zero network latency between them. Think of this as your entry point—fine-tuning, interactive development, inference that needs serious horsepower but not cluster-scale compute.

Multi-host TPUs connect multiple VMs within a single node pool. Gowda walks through a 16x16 example: each VM has four chips, you get 16 VMs, that's 64 chips total, all interconnected via TPU ICI (interchip interconnect) links. "Now you went from single host to multiple VMs with multi-host and they're all connected to bring you higher power of TPUs for your workloads," she says.

Multi-slice TPUs are where things get wild. This connects multiple node pools—each already containing thousands of chips—over the data center network. Within each node pool, you still have those high-bandwidth ICI links. Between node pools, you're on datacenter networking. The distinction matters because where you place your workload relative to these network boundaries affects performance.

The crucial insight: GKE treats each slice as one atomic unit. "GKE considers them with job set and its technologies as this one atomic unit and tries to auto repair these TPU slices as this one atomic unit," Gowda explains. If one VM fails in your 50,000-chip training run, the whole thing goes down—because it needs every chip running. But GKE automatically repairs and restarts the job, maximizing what Google calls "goodput" (not throughput, goodput—actual productive compute time on expensive hardware).

The Capacity Optimization Game

Google's approach to TPU availability is unexpectedly... flexible? They've built multiple tiers:

Committed use discounts (CUDs) are reserved capacity—your own TPUs, guaranteed available, suitable for everything from massive training runs to online inference.

Dynamic Workload Scheduler (DWS) comes in two flavors. Flex mode is pay-as-you-go for time-flexible experiments—you get uninterrupted VMs for up to seven days, but you only pay when your workload is actually running. GKE auto-scales the node pool when your job lands, then scales back down when it's done. Calendar mode is one-to-three-month reservations for guaranteed runtime—you're reserving specific capacity for a specific window.

The pricing logic makes sense when you consider the stakes. A foundation model training run can take weeks or months. An infrastructure failure mid-run isn't just annoying—it's financially catastrophic. Calendar reservations give you dedicated, uninterrupted capacity for the full duration.

But here's the clever part: custom compute classes let you define a prioritized hierarchy of TPU configurations. Want Trillium chips on reservation? Set that as primary. If reservation capacity isn't available when you scale, GKE automatically falls back to spot, then on-demand or DWS Flex—whatever you've specified. And if higher-priority capacity becomes available, it migrates you back up the chain.

You're optimizing for either "give me TPU power however you can get it" or "give me the most cost-effective path to TPU power." The system handles the logistics.

The Software-Hardware Knowledge Problem

One tension that surfaced multiple times: effectively using this infrastructure requires understanding both the software and the hardware. You can't just throw a training job at 50,000 chips and expect optimal performance. You need to know which parts of the multi-slice network are connected via high-speed ICI links versus datacenter networking. You need to design your code to match the infrastructure.

This isn't a GKE-specific issue—it's fundamental to the current moment in AI. As compute scales to these massive topologies, the abstraction layers get leaky. The hardware architecture bleeds through into software design decisions.

GKE's attempting to hide complexity where possible (treating slices as atomic units, auto-scaling node pools, handling failover automatically), but the scale itself creates irreducible complexity. Companies like Anthropic, Moloco, and LightTricks are already running production workloads on this infrastructure, which suggests the learning curve is manageable—but it's definitely there.

What We Don't Know

A few questions that didn't get addressed: How does this compare to competing approaches from NVIDIA, AMD, or other custom silicon? What's the actual utilization rate companies are seeing in production? How much ML engineering time goes into optimizing for the specific topology versus just writing model code?

And perhaps most importantly: as TPUs reach their seventh generation and the scale keeps climbing, are we approaching some fundamental limit—physical, economic, or practical? Or is this just the beginning of another exponential curve?

For now, the infrastructure exists. The 130,000-node GKE clusters are running. The 9,216-chip pods are training frontier models. Whether this particular architecture becomes the standard or just one viable approach among many—that's still being written in production.

— Yuki Okonkwo, AI & Machine Learning Correspondent