Nvidia is riding high at the moment. The company has managed to increase the performance of its chips on AI tasks a thousandfold over the past 10 years, it’s raking in money, and it’s reportedly very hard to get your hands on its newest AI-accelerating GPU, the H100.
How did Nvidia get here? The company’s chief scientist, Bill Dally, managed to sum it all up in a single slide during his keynote address to the IEEE’s Hot Chips 2023 symposium in Silicon Valley on high-performance microprocessors last week. Moore’s Law was a surprisingly small part of Nvidia’s magic and new number formats a very large part. Put it all together and you get what Dally called Huang’s Law (for Nvidia CEO Jensen Huang).

Nvidia
 chief scientist Bill Dally summed up how Nvidia has boosted the 
performance of its GPUs on AI tasks a thousandfold over 10 years.Nvidia 
Number Representation: 16x“By and large, the biggest gain we got was from better number representation,” Dally told engineers. These numbers represent the key parameters of a neural network. One such parameter is weights—the strength of neuron-to-neuron connections in a model—and another is activations—what you multiply the sum of the weighted input at the neuron to determine if it activates, propagating information to the next layer. Before the P100, Nvidia GPUs represented those weights using single precision floating-point numerals. Defined by the IEEE 754 standard, these are 32 bits long, with 23 bits representing a fraction, 8 bits acting essentially as an exponent applied to the fraction, and one bit for the number’s sign.But machine-learning researchers were quickly learning that in many calculations, they could use less precise numbers and their neural network would still come up with answers that were just as accurate. The clear advantage of doing this is that the logic that does machine learning’s key computation—multiply and accumulate—can be made faster, smaller, and more efficient if they need to process fewer bits. (The energy needed for multiplication is proportional to the square of the number of bits, Dally explained.) So, with the P100, Nvidia cut that number in half, using FP16. Google even came up with its own version called bfloat16. (The difference is in the relative number of fraction bits, which give you precision, and exponent bits, which give you range. Bfloat16 has the same number of range bits as FP32, so it’s easier to switch back and forth between the two formats.)
Fast forward to today, and Nvidia’s leading GPU, the H100, can do certain parts of massive-transformer neural networks, like ChatGPT and other large language models, using 8-bit numbers. Nvidia did find, however, that it’s not a one-size-fits-all solution. Nvidia’s Hopper GPU architecture, for example, actually computes using two different FP8 formats, one with slightly more accuracy, the other with slightly more range. Nvidia’s special sauce is in knowing when to use which format....
....MUCH MORE
We last checked in with the good professor in "Hot Chips 2023 At Stanford: Nvidia Chief Scientist Bill Dally's Keynote Speech (NVDA)".
Also at IEEE Spectrum: "Nvidia’s Next GPU Shows That Transformers Are Transforming AI "