Sunday, December 14, 2025

Chips: "AWS Trainium3 Deep Dive | A Potential Challenger Approaching" (AMZN; NVDA)

From Semianalysis, December 4:

Hot on the heels of our 10K word deep dive on TPUs, Amazon launched Trainium3 (Trn3) general availability and announced Trainium4 (Trn4) at its annual AWS re:Invent. Amazon has had the longest and broadest history of custom silicon in the datacenter. While they were behind in AI for quite some time, they are rapidly progressing to be competitive. Last year we detailed Amazon’s ramp of its Trainium2 (Trn2) accelerators aimed at internal Bedrock workloads and Anthropic’s training/inference needs.

Today, we are publishing our next technical bible on the step-function improvement that is the Trainium3 chip, microarchitecture, system and rack architecture, scale up, profilers, software platform, and datacenters ramps. This is the most detailed piece we've written on an accelerator and its hardware/software, on desktop there is a table of contents that makes it possible to review specific sections.

Amazon Basics GB200 aka GB200-at-Home
With Trainium3, AWS remains laser-focused on optimizing performance per total cost of ownership (perf per TCO). Their hardware North Star is simple: deliver the fastest time to market at the lowest TCO. Rather than committing to any single architectural design, AWS maximizes operational flexibility. This extends from their work with multiple partners on the custom silicon side to the management of their own supply chain to multi-sourcing multiple component vendors.

On the systems and networking front, AWS is following an “Amazon Basics” approach that optimizes for perf per TCO. Design choices such as whether to use a 12.8T, 25.6T or a 51.2T bandwidth scale-out switch or to select liquid vs air cooling are merely a means to an end to provide the best TCO for the given client and the given datacenter.

For the scale-up network, while Trn2 only supports a 4x4x4 3D Torus mesh scaleup topology, Trainium3 adds a unique switched fabric that is somewhat similar to the GB200 NVL36x2 topology with a few key differences. This switched fabric was added because a switched scaleup topology has better absolute performance and perf per TCO for frontier Mixture-of-Experts (MoE) model architectures.

Even for the switches used in this scale-up architecture, AWS has decided to not decide: they will go with three different scale-up switch solutions over the lifecycle of Trainium3, starting with a 160 lane, 20 port PCIe switch for fast time to market due to the limited availability today of high lane & port count PCIe switches, later switching to 320 Lane PCIe switches and ultimately a larger UALink to pivot towards best performance.

Amazon’s Software North Star
On the software front, AWS’s North Star expands and opens their software stack to target the masses, moving beyond just optimizing perf per TCO for internal Bedrock workloads (ie DeepSeek/Qwen/etc which run a private fork of vLLM v1) and for Anthropic’s training and inference workloads (which runs a custom inference engine and all custom NKI kernels).

In fact, they are conducting a massive, multi-phase shift in software strategy. Phase 1 is releasing and open sourcing a new native PyTorch backend. They will also be open sourcing the compiler for their kernel language called “NKI” (Neuron Kernal Interface) and their kernel and communication libraries matmul and ML ops (analogous to NCCL, cuBLAS, cuDNN, Aten Ops). Phase 2 consists of open sourcing their XLA graph compiler and JAX software stack.

By open sourcing most of their software stack, AWS will help broaden adoption and kick-start an open developer ecosystem. We believe the CUDA Moat isn’t constructed by the Nvidia engineers that built the castle, but by the millions of external developers that dig the moat around that castle by contributing to the CUDA ecosystem. AWS has internalized this and is pursuing the exact same strategy.

Trainium3 will only have Day 0 support for Logical NeuronCore (LNC) = 1 or LNC = 2. LNC = 1 or LNC = 2 is what ultra-advanced, elite L337 kernel engineers at Amazon/Anthropic want, but LNC=8 is what the wider ML research scientist community prefers before widely adopting Trainium. Unfortunately, AWS does not plan on supporting LNC=8 until mid-2026. We will expand much more on what LNC is and why the different modes are critical for research scientist adoption further down.

Trainium3’s go-to-market opens yet another front Jensen must now contend with in addition to the other two battle theatres facing the extremely strong perf per TCO Google’s TPUv7 as well as a resurgent AMD’s MI450X UALoE72 with potentially strong perf per TCO (especially after the “equity rebate” OpenAI gets to own up to 10% of AMD shares).

We still believe Nvidia will stay King of the Jungle as long as they continue to keep accelerating their pace of development and move at the speed of light. Jensen needs to ACCELERATE even faster than he has over the past 4 months. In the same way that Intel stayed complacent in the CPU while others like AMD and ARM raced ahead, if Nvidia stays complacent they will lose their pole position even more rapidly.

Today, we will discuss the two Trainium3 rack SKUs that support switched scale-up racks:

  • Air Cooled Trainium3 NL32x2 Switched (Codename “Teton3 PDS”)

  • Liquid Cooled Trainium3 NL72x2 Switched (Codename “Teton3 MAX”)

We will start by briefly reviewing the Trn2 architecture and explaining the changes introduced with Trainium3. The first half of the article will focus on the various Trainium3 rack SKUs’ specifications, silicon design, rack architecture, bill of materials (BoM) and power budget before we turn to the scale-up and scale-out network architecture. In the second half of this article, we will focus on discussing the Trainium3 Microarchitecture and expand further on Amazon’s software strategy. We will conclude with a discussion on Amazon and Anthropic’s AI Datacenters before tying everything together with a Total Cost of Ownership (TCO) and Perf per TCO analysis....

....MUCH MORE