Thursday, June 29, 2023

Chips: "Nvidia Leads, Habana Challenges on MLPerf GPT-3 Benchmark" (NVDA; INTC)

From EE Times, June 26:

The latest round of MLPerf training benchmarks includes GPT-3, the model ChatGPT is based on, for the first time. The GPT-3 training crown was claimed by cloud provider CoreWeave using more than 3,000 Nvidia H100 GPUs. What’s more surprising is that there were no entries from previous training submitters Google, Graphcore and others, or other competitors like AMD. It was left to Intel’s Habana Labs to be the only challenger to Nvidia on GPT-3 with its Gaudi2 accelerator.

CoreWeave used 3,584 Nvidia HGX-H100s to train a representative portion of GPT-3 in 10.94 minutes (this is the biggest number of GPUs the cloud provider could make available at one time, and is not the full size of its cluster). A portion of GPT-3 is used for the benchmark since it would be impractical to insist submitters train the entirety of GPT-3, which could take months and cost millions of dollars. Submitters instead train an already partially-trained GPT-3 from a particular checkpoint until it converges to a certain accuracy. The portion used is about 0.4% of the total training workload for GPT-3; based on CoreWeave’s 10.94 minutes score, 3,584 GPUs would take almost two days to train the whole thing.

Nvidia graph - H100 versus Gaudi2, Xeon

Nvidia’s graph shows per-accelerator performance for its H100 versus Intel Xeon CPUs and 
Habana Labs Gaudi2 (normalized to H100 result—taller is better). (Source: Nvidia)
Nvidia H100s were used for the bulk of the GPT-3 submissions. This is the leading hardware for AI training on the market. Its software includes Nvidia’s Transformer Engine, designed specifically to speed up training and inference of networks based on the same architecture as GPT-3, by lowering precision to FP8 to improve throughput wherever possible....