Sunday, October 6, 2024

Chips and Data Centers: "AI Neocloud Playbook and Anatomy"

From Semianalysis, October

H100 Rental Price Cuts, AI Neocloud Giants and Emerging Neoclouds, H100 Cluster Bill of Materials and Cluster Deployment, Day to Day Operations, Cost Optimizations, Cost of Ownership and Returns

The rise of the AI Neoclouds has captivated the attention of the entire computing industry. Everyone is using them for access to GPU compute, from enterprises to startups. Even Microsoft is spending ~$200 million a month on GPU compute through AI Neoclouds despite having their own datacenter construction and operation teams. Nvidia has heralded the rapid growth of several AI Neoclouds through direct investments, large allocations of their GPUs, and accolades in various speeches and events.

An AI Neocloud is defined as a new breed of cloud compute provider focused on offering GPU compute rental. These pure play GPU clouds offer cutting edge performance and flexibility to their customers, but the economics powering them are still evolving just as the market is learning how their business models work.

In the first half of this deep dive, we will peel back the layers of running a Neocloud, from crafting a cluster Bill of Materials (BoM), to navigating the complexities of deployment, funding, and day-to-day operations. We will provide several key recommendations in terms of BoM and cluster architecture.

In the second half of the report. we explain the AI Neocloud economy and discuss in detail these Neoclouds’ go to market strategies, total cost of ownership (TCO), margins, business case and potential return on investment for a variety of situations.

Lastly, we will address the rapid shifts in H100 GPU rental pricing from a number of different hyperscale and neoclouds, discussing the meaningful declines in on-demand pricing in just the past month, as well as shifts in the term structure for H100 GPU contract pricing and how the market will evolve with upcoming deployments of Blackwell GPUs.

Further granularity and higher frequency data on GPU pricing across many SKUs is available in our AI GPU Rental Price Tracker....

....The Giant and the Emerging

The AI Neocloud market is served by four main categories of providers, Traditional Hyperscalers, Neocloud Giants, Emerging Neoclouds, and Brokers/Platforms/Aggregators.  

The AI Neocloud market is huge and is the most meaningful incremental driver of GPU demand. In broad strokes, we see the Neoclouds growing to more than a third of total demand.

Traditional hyperscalers offering AI cloud services include Google Cloud (GCP), Microsoft Azure, Amazon Web Services (AWS), Oracle, Tencent, Baidu, Alibaba. In contrast, Meta, xAI, ByteDance and Tesla, despite also having formidable GPU fleets and considerable capacity expansion plans, do not currently offer AI services, and thus do not fall into this group.

Traditional hyperscalers’ diversified business models allow them the lowest cost of capital, but their integrated ecosystem and data lakes, existing enterprise customer base mean very premium pricing compared to others. Hyperscalers also tend to earn high margins on their cloud business and so pricing is set much higher than reasonable for AI cloud purposes.

AI Neocloud Giants, unlike traditional hyperscalers, focus almost exclusively on GPU Cloud services. The largest have current or planned capacity in the next few years well in excess of 100k H100 equivalents in aggregate across all their sites, with some planning for hundreds of thousands of Blackwell GPUs for OpenAI. The main three Neocloud Giants are Crusoe, Lambda Labs, and Coreweave, which is by far the largest. They have a higher cost of capital compared to the hyperscalers but usually have a better access to capital at a reasonable rate vs Emerging AI Neoclouds, which means a lower comparative cost of ownership for Neocloud Giants.....

....MUCH MORE

From EE Times, June 26:

The latest round of MLPerf training benchmarks includes GPT-3, the model ChatGPT is based on, for the first time. The GPT-3 training crown was claimed by cloud provider CoreWeave using more than 3,000 Nvidia H100 GPUs. What’s more surprising is that there were no entries from previous training submitters Google, Graphcore and others, or other competitors like AMD. It was left to Intel’s Habana Labs to be the only challenger to Nvidia on GPT-3 with its Gaudi2 accelerator.

CoreWeave used 3,584 Nvidia HGX-H100s to train a representative portion of GPT-3 in 10.94 minutes (this is the biggest number of GPUs the cloud provider could make available at one time, and is not the full size of its cluster). A portion of GPT-3 is used for the benchmark since it would be impractical to insist submitters train the entirety of GPT-3, which could take months and cost millions of dollars. Submitters instead train an already partially-trained GPT-3 from a particular checkpoint until it converges to a certain accuracy. The portion used is about 0.4% of the total training workload for GPT-3; based on CoreWeave’s 10.94 minutes score, 3,584 GPUs would take almost two days to train the whole thing.....

December 2023
Cloud: GPU's as a Service Gets Big Backers (GaaS)
March 2024
"Nvidia CEO Becomes Kingmaker by Name-Dropping Stocks" (NVDA+++++++)
March 2024
Google Cloud Is Losing Top Executives
March 26
AI In The Cloud: "CoreWeave Is in Talks for Funding at $16 Billion Valuation"

And on Lambda – formerly known as Lambda Labs:
February 2024
For Those Who Can't Afford An Nvidia H-100 Chip: "Lambda Snags $320 Million To Grow Its Rent-A-GPU Cloud"
A single chip will set you back over $40,000. For every one else there's Cloud GPU as a Service.