Thursday, June 6, 2024

"Nvidia Unfolds GPU, Interconnect Roadmaps Out To 2027" (Power, Power, Power) NVDA

As noted in passing in June 4's "All the datacenter roadmap updates Intel, AMD, Nvidia teased at Computex

From The Next Platform, June 2:

There are many things that are unique about Nvidia at this point in the history of computing, networking, and graphics. But one of them is that it has so much money on hand right now, and such a lead in the generative AI market thanks to its architecture, engineering, and supply chain, that it can indulge in just about any roadmap whim it thinks might yield progress.

Nvidia was already a wildly successful innovator by the 2000s that it really did not have to expand into datacenter compute. But HPC researchers pulled Nvidia into accelerated computing, and then AI researchers took advantage of GPU compute and created a whole new market that had been waiting four decades for a massive amount of compute at a reasonable price to collide with huge amounts of data to truly bring into life what feels more and more like thinking machines.

Tip of the hat to Danny Hillis, Marvin Minksy, and Sheryl Handler, who tried to build such machines in the 1980s when they founding Thinking Machines to drive AI processing, not traditional HPC simulation and modeling applications, and to Yann LeCun, who was creating convolutional neural networks around the same time at AT&T Bell Labs. They had neither the data nor the compute capacity to make AI as we now know it works. At the time, Jensen Huang was a director at LSI Logic, which made storage chips, and eventually was a CPU designer at AMD. And just as Thinking Machines was having a tough time in the early 1990s (and eventually went bankrupt), Huang had a meeting at the Denny’s on the east side of San Jose with Chris Malachowsky and Curtis Priem, and they founded Nvidia. And it is Nvidia that saw the emerging AI opportunity coming out of the research and hyperscaler communities and started building the systems software and underlying massively parallel hardware that would fulfill the AI revolution dreams that have always been part of computing since Day One.

This was always the end state of computing, and this was always the singularity – or maybe bipolarity – that we have been moving towards. If there is life on other planets, then life always evolves to a point where that world has weapons of mass destruction and always creates artificial intelligence. And probably at about the same time, too. It is what that world does with either technologies after that moment that determines if any it survives a mass extinction event or not.

This may not sound like a normal introduction to a discussion of a chip maker’s roadmap. It isn’t, and that is because we live in interesting times.

During his keynote at the annual Computex trade show in Taipei, Taiwan, Nvidia’s co-founder and chief executive officer once again tried to put the generative AI revolution – which he calls the second industrial revolution – into its context and give a glimpse into the future of AI in general and for Nvidia’s hardware in particular. We got a GPU and interconnect roadmap peek – which as far as we know was not part of the plan until the last minute, as is often the case with Huang and his keynotes.

Revolution Is Inevitable
Generative AI is all about scale, and Huang reminded us of this and pointed out that the ChatGPT moment at the end of 2022 could only have happened when it did for technical as well as economic reasons.

To get to the ChatGPT breakthrough moment requires a lot of growth in performance of GPUs, and then a lot of GPUs on top of that. Nvidia has certainly delivered on the performance, which is important for both AI training and inference, and importantly, has radically reduced the amount of energy it takes to generate tokens as part of large language model responses. Take a look:

http://www.nextplatform.com/wp-content/uploads/2024/06/nvidia-computex-gpu-performance-energy-over-time.jpg

 The performance of a GPU has risen by 1,053X over the eight years between the “Pascal” P100 GPU generation and the “Blackwell” B100 GPU generation that will start shipping later this year and ramp on into 2025. (We know that the chart says 1,000X, but that is not precise.)

Some of that performance has come through the lowering of floating point precision – by a factor of 4X, in the shift from FP16 formats in Pascal P100, Volta V100, and Ampere A100 GPUs to FP4 formats used in the Blackwell B100s. Without that reduction in precision, which can be done without substantially hurting LLM performance – thanks to a lot of mathematical magic in data formats, software processing, and the hardware that does it – the performance increase would have been only 263X. Mind you, that is pretty good for eight years in the CPU market, where a 10 percent to 15 percent increase in core performance per clock and maybe a 25 percent to 30 percent increase in the number of cores is normal. That is somewhere between a 4X and 5X increase in CPU throughput over the same eight years if the upgrade cycle is two years.

The power reduction per unit of work as shown above is a key metric, because if you can’t power the system, you can’t use it. The energy cost of a token has to come down and that means the energy per token generated for LLMs has to come down faster than the performance increases....

....MUCH MORE