🐝Daily 1 Bite
ai📖 8 min read

Google TPU 8th Gen: Why Splitting Training (8t) and Inference (8i) Into Two Chips Changes Everything [Apr 2026]

Google's 8th-gen TPU splits into training-focused 8t and inference-focused 8i. 121 ExaFlops training, 80% inference cost savings — here's what developers need to know.

A꿀벌I📖 8 min read
#Google TPU#AI chip#TPU 8th gen#AI infrastructure#NVIDIA competition#cloud AI

TL;DR: Google unveiled 8th-generation TPUs at Google Cloud Next on April 22, 2026 — not one chip, but two: TPU 8t (training) and TPU 8i (inference). TPU 8t scales to 9,600 chips delivering 121 ExaFlops. TPU 8i cuts inference costs by up to 80% over Ironwood. Anthropic committed to multi-gigawatt usage. Both support JAX, PyTorch, and vLLM. GA expected H2 2026.


"Can't one chip handle both training and inference?"

That's what I used to think. Spin up a GPU, fine-tune your model, then use the same GPU to serve it. Simple enough, right?

Then Google announced two entirely separate TPU architectures for its 8th generation — one for training, one for inference — and I realized how much I'd been glossing over the fundamental differences between these two workloads.

Google Cloud Next 2026 keynote Source: Google Cloud Blog | Google Cloud Next 2026 where TPU 8th gen was unveiled

Why Training and Inference Actually Need Different Hardware

Let me break this down because it genuinely matters.

Training is like climbing a mountain. You're updating hundreds of billions of parameters, iterating over massive datasets, and coordinating synchronized updates across thousands of chips. What matters most: raw compute throughput and inter-chip communication bandwidth. The chip needs to crunch numbers fast and talk to its neighbors even faster.

Inference is the descent. You have a finished model; you need to answer questions as fast and cheaply as possible. The bottleneck here isn't raw compute — it's memory bandwidth. How quickly can you read the model weights from memory? How efficiently can you manage the KV cache for long-context agentic workflows?

These two workloads have fundamentally different hardware requirements. Training wants massive compute density and scale-up bandwidth. Inference wants enormous memory bandwidth and low latency. Every single GPU generation until now has tried to balance these competing needs in one chip. With TPU 8th gen, Google said: stop compromising.

TPU 8t: Built for Frontier Model Training

The 't' stands for Training. Here's what Google shipped:

TPU 8t and TPU 8i — the two-chip architecture Source: Google Blog | Two chips for the agentic era

SpecTPU 8t
Superpod scaleUp to 9,600 chips
Shared HBM2 petabytes
Peak performance121 ExaFlops
Storage access10x faster than Ironwood
Training price-performance2.7x over Ironwood
Linear scalingUp to 1 million chips

The architecture choices here are deliberate. TPU 8t uses a 3D Torus ICI topology paired with the new Virgo network, delivering 47 petabits/second of non-blocking bisectional bandwidth across a 9,600-chip superpod. Google reports ~97% "goodput" — meaning almost all compute time is actually doing productive work, not waiting on communication.

Native FP4 support is worth highlighting specifically. By training with 4-bit floating point, TPU 8t halves memory bandwidth requirements for many operations, which is especially beneficial for Mixture-of-Experts (MoE) models that have seen massive adoption in frontier training runs.

To put the scale in context: Anthropic has already committed to multi-gigawatt TPU usage from Google — a deal I covered when the $19B ARR announcement dropped. TPU 8t is almost certainly what Anthropic will be training on.

TPU 8i: Built for the Inference-Heavy Agentic Era

The 'i' stands for Inference. The design philosophy here is "memory, more of it, faster":

TPU 8i technical architecture Source: Google Cloud Blog | TPU 8i technical deep dive

SpecTPU 8i
HBM capacity288 GB
On-chip SRAM (Vmem)384 MB (3x previous gen)
HBM bandwidth8,601 GB/s
ICI bandwidth19.2 Tb/s
Inference price-performanceUp to 80% improvement over Ironwood
Max pod size1,024 active chips

The headline innovation is the Collectives Acceleration Engine (CAE). Google replaced four SparseCores with a single CAE that reduces on-chip collective operation latency by 5x. In multi-agent AI systems — where dozens or hundreds of agents run concurrently and need to coordinate results — this latency directly impacts user-facing response time. The CAE is unambiguously designed for the agentic AI workloads Google sees dominating inference demand.

The Boardfly topology is equally interesting. Instead of the 3D torus used in TPU 8t, Boardfly uses a hierarchical structure that reduces maximum network diameter by over 50%, cutting all-to-all communication latency in half. And that 384 MB on-chip SRAM (3x increase) means the KV cache for large MoE models can live on-chip, dramatically reducing the latency hit from long-context reasoning chains.

How This Plays Against NVIDIA Rubin

NVIDIA announced the Rubin platform around the same time, with cloud providers planning Rubin-based instances for H2 2026. NVIDIA's claims: 10x inference token cost reduction and 4x fewer GPUs needed for training versus Blackwell.

So how does Google TPU 8th gen compare?

Honestly, a direct apples-to-apples comparison is hard. NVIDIA still takes the single-chip approach that handles both training and inference, while Google has committed to specialization. The ecosystems are also fundamentally different: NVIDIA's CUDA moat is massive, while Google TPU is strongest in the JAX/Pathways world.

What's clear is that the broader AI chip competition is intensifying — Cerebras, AMD, and a wave of well-funded startups are all chasing NVIDIA's dominance. Google's two-chip bet signals confidence that the era of "one chip for everything" is ending as AI workloads mature and specialize.

For developers, more competition means lower costs. That's a good outcome regardless of which vendor wins.

What Developers Need to Know Right Now

Software compatibility

Both TPU 8t and 8i support the main frameworks:

  • JAX (primary, recommended)
  • PyTorch (native preview support)
  • vLLM (inference serving)
  • MaxText, Keras, SGLang

CUDA code doesn't run directly on TPUs — that's the same situation as previous generations. If you're already in the JAX/XLA world, migration overhead is manageable. If you're running CUDA-native workflows, porting remains a significant investment.

Availability

Both chips are targeting H2 2026 GA. Google has an early access interest form on the Cloud TPU page. Expect hyperscaler and large enterprise customers (think Anthropic-scale users) to get access first. If you're an independent developer or small team, budget a few months after the official GA date for availability at reasonable quotas.

Pricing

Not announced yet. Given the 80% inference efficiency improvement claim on TPU 8i versus Ironwood, you'd hope to see meaningfully lower per-token pricing when it launches — but we'll see how Google actually prices it in the market.

What about existing TPU users?

If you're already running workloads on Ironwood (TPU v7), your existing JAX/PyTorch code should migrate without major surgery. Google has maintained backward compatibility across TPU generations for their core software stack. The XLA compiler handles much of the hardware-specific optimization automatically, and the Boardfly topology of TPU 8i is abstracted away by the compiler. That said, if you have hand-tuned kernel code using Pallas or Mosaic, you'll want to revisit those optimizations for the new memory hierarchy — especially around the 3x larger on-chip SRAM in TPU 8i.

A Note on Ecosystem Lock-In

One honest concern worth raising: the deeper you go into Google's TPU ecosystem, the harder it becomes to move to NVIDIA later. JAX and XLA are genuinely excellent tools, but the CUDA ecosystem's breadth — third-party libraries, community support, tutorials, hire-ability of engineers who know it — remains an advantage that Google hasn't closed.

My read: if you're already a Google Cloud user with workloads in JAX, TPU 8th gen is a compelling upgrade path. If you're starting fresh and want optionality, the CUDA ecosystem remains the safer default until TPU availability and pricing are clearer. The two-chip architecture is clever, but clever doesn't automatically translate to affordable or accessible for teams outside the hyperscaler tier.

The Bottom Line

TPU 8t vs 8i comparison summary Source: Google Cloud Blog | TPU 8t and TPU 8i compared

Google's decision to split TPU 8th gen into two chips reflects a maturation of the AI infrastructure market. Training and inference aren't just "different speeds" of the same operation — they have fundamentally different hardware requirements, and building one chip that does both means compromising on both.

TPU 8i, with its focus on latency and memory bandwidth, arrives at exactly the right time: the agentic AI era that Google has been positioning for. When millions of AI agents run concurrently and every millisecond of latency affects product quality, purpose-built inference silicon starts making a lot of sense.

I'll be watching the H2 GA launch carefully. If early access opens up before then, testing TPU 8i's actual inference throughput on vLLM is first on my list. Anyone else planning to try it out?


References

Related Posts:

💬 댓글