From The Register, July 4:
It's almost like AWS is building its own Stargate
deep dive Amazon Web Services (AWS) is in the process of building out a massive supercomputing cluster containing "hundreds of thousands" of accelerators that promises to give its model building buddies at Anthropic a leg up in the AI arms race.
The system, dubbed Project Rainier, is set to come online later this year with compute spanning multiple sites across the US. Gadi Hutt, the director of product and customer engineering at Amazon's Annapurna Labs, tells El Reg that one site in Indiana will span thirty datacenters at 200,000 square feet apiece. This facility alone was recently reported to consume upwards of 2.2 gigawatts of power.
But unlike OpenAI's Stargate, xAI's Collusus, or AWS's own Project Ceiba, this system isn't using GPUs. Instead, Project Rainier will represent the largest deployment of Amazon's Annapurna AI silicon ever.
"This is the first time we are building such a large-scale training cluster that will allow a customer, in this case Anthropic, to train a single model across all of that infrastructure," Hutt said. "The scale is really unprecedented."
Amazon, in case you've forgotten, is among Anthropic's biggest backers, having already invested $8 billion in the OpenAI rival.
Amazon isn't ready to disclose the full scope of the project, and since it's a multi-site project akin to Stargate as opposed to a singular AI factory like Colossus, Project Rainier may not have upper bounds. And all plans assume the economic conditions that gave rise to the AI boom don't fizzle out.
However, we're told that Anthropic has already managed to get their hands on a sliver of the system's compute.
While we don't know just how many Trainium chips or datacenters will ultimately power Project Rainier and probably won't until re:Invent in November, we do have a pretty good idea of what it'll look like. So, here's everything we know about Project Rainier so far.
The basic unit of compute
The heart of Project Rainier is Annapurna Lab's Trainium2 accelerator, which it let loose on the web back in December.
Despite what its name might suggest, the chip can be used for both training and inference workloads, which will come in handy for customers using reinforcement learning (RL) like we saw with DeepSeek R1 and OpenAI's o1 to imbue their models with reasoning capabilities.
"RL as a workload has a lot of inference built into it because we need to verify the results during the steps of training," Hutt said.
The chip itself features a pair of 5nm compute dies glued together using TSMC's chip-on-wafer-on-substrate (CoWoS) packaging tech, which is fed by four HBM stacks. Combined, each Trainium2 accelerator offers 1.3 petaFLOPS of dense FP8 performance, 96GB of HBM, and 2.9TB/s of memory bandwidth.
On its own, the chip doesn't look all that competitive. Nvidia's B200, for instance, boasts 4.5 petaFLOPS of dense FP8, 192GB of HBM3e, and 8TB/s of memory bandwidth.
Support for 4x sparsity, which can dramatically speed up AI training workloads, does help Tranium2 close the gap some, boosting FP8 perf to 5.2 petaFLOPS, but it still falls behind the B200 at 9 petaFLOPS of sparse compute at the same precision.
Trn2
While Tranium2 may look a little anemic in a chip-for-chip comparison with Nvidia's latest accelerators, that doesn't tell the full story.Unlike the H100 and H200-series GPUs, Nvidia's B200 only comes in an eight-way HGX form factor. Similarly, AWS' minimum configuration for Trainium2, which it refers to as its Trn2 instances, has 16 accelerators.
"When you're talking about large clusters, it's less important what a single chip provides you, it's more of what is called 'good put,'" Hutt explains. "What's your good throughput of training which also takes into account downtime? … I don't see a lot of talk about this in the industry, but this is the metric that customers are looking at."
Compared to Nvidia's HGX B200 systems, the performance gap is far closer. The Blackwell-based parts still have an advantage when it comes to memory bandwidth and dense FP8 compute, which are key indicators of inference performance.
For training workloads, Amazon's Trn2 instances do have a bit of advantage as they — at least on paper — offer higher sparse floating performance at FP8. Yes, Nvidia's Blackwell chips do support 4-bit floating point precision, but we've yet to see anyone train a model at that precision. Sparse compute is most useful when large volumes of data are expected to have values of zero. As a result, sparsity isn't usually that helpful for inference, but can make a big difference in training.
With that out of the way, here's a quick look at how Nvidia's Blackwell B200 stacks up against AWS' Trn2 instances....
....MUCH MORE, when El Reg says deep dive, they go deep. And you know Nvidia is paying attention to the interloper.