Sunday, July 23, 2023

NVIDIA's DGX GH200 Supercomputer and Grace Hopper Superchips in Production (NVDA)

 From Tom's Hardware, May 28:

Nvidia’s new supercomputing silicon fuels the fires of AI.

Nvidia CEO Jensen Huang announced here at Computex 2023 in Taipei, Taiwan that the company's Grace Hopper superchips are now in full production, and the Grace platform has now earned six supercomputer wins. These chips are a fundamental building block of one of Huang's other big Computex 2023 announcements: The company's new DGX GH200 AI supercomputing platform, built for massive generative AI workloads, is now available with 256 Grace Hopper Superchips paired together to form a supercomputing powerhouse with 144TB of shared memory for the most demanding generative AI training tasks. Nvidia already has customers like Google, Meta, and Microsoft ready to receive the leading-edge systems.

Nvidia also announced its new MGX reference architectures that will help OEMs build new AI supercomputers faster with up to 100+ systems available. Finally, the company also announced its new Spectrum-X Ethernet networking platform that is designed and optimized specifically for AI server and supercomputing clusters. Let's dive in.

Nvidia Grace Hopper Superchips Now in Production

We've covered the Grace and Grace Hopper Superchips in depth in the past. These chips are central to Nidia's new systems that it announced today. The Grace chip is Nvidia's own Arm CPU-only processor, and the Grace Hopper Superchip combines the Grace 72-core CPU, a Hopper GPU, 96GB of HBM3, and 512 GB of LPDDR5X on the same package, all weighing in at 200 billion transistors. This combination provides astounding data bandwidth between the CPU and GPU, with up to 1 TB/s of throughput between the CPU and GPU offering a tremendous advantage for certain memory-bound workloads.

Nvidia DGX GH200 Supercomputer

Nvidia's DGX systems are its go-to system and reference architecture for the most demanding AI and HPC workloads, but the current DGX A100 systems are limited to eight A100 GPUs working in tandem as one cohesive unit. Given the explosion of generative AI, Nvidia's customers are eager for much larger systems with much more performance, and the DGX H200 is designed to offer the ultimate in throughput for massive scalability in the largest workloads, like generative AI training, large language models, recommender systems and data analytics, by sidestepping the limitations of standard cluster connectivity options, like InfiniBand and Ethernet, with Nvidia's custom NVLink Switch silicon.

Details are still slight on the finer aspects of the new DGX GH200 AI supercomputer, but we do know that Nvidia uses a new NVLink Switch System with 36 NVLink switches to tie together 256 GH200 Grace Hopper chips and 144 TB of shared memory into one cohesive unit that looks and acts like one massive GPU. The new NVLink Switch System is based on its NVLink Switch silicon that is now in its third generation.

The DGX GH200 comes with 256 total Grace Hopper CPU+GPUs, easily outstripping Nvidia's previous largest NVLink-connected DGX arrangement with eight GPUs, and the 144TB of shared memory is 500X more than the DGX A100 systems that offer a 'mere' 320GB of shared memory between eight A100 GPUs. Additionally, expanding the DGX A100 system to clusters with more than eight GPUs requires employing InfiniBand as the interconnect between systems, which incurs performance penalties. In contrast, the DGX GH200 marks the first time Nvidia has built an entire supercomputer cluster around the NVLink Switch topology, which Nvidia says provides up to 10X the GPU-to-GPU and 7X the CPU-to-GPU bandwidth of its previous-gen system. It's also designed to provide 5X the interconnect power efficiency (likely measured as PJ/bit) than competing interconnects, and up to 128 TB/s of bisectional bandwidth.

The system has 150 miles of optical fiber and weighs 40,000 lbs, but presents itself as one single GPU. Nvidia says the 256 Grace Hopper Superchips propel the DGX GH200 to one exaflop of 'AI performance,' meaning that value is measured with smaller data types that are more relevant to AI workloads than the FP64 measurements used in HPC and supercomputing. This performance comes courtesy of 900 GB/s of GPU-to-GPU bandwidth, which is quite impressive scalability given that Grace Hopper tops out at 1 TB/s of throughput with the Grace CPU when connected directly together on the same board with the NVLink-C2C chip interconnect....


If interested, our last look at Admiral Grace Hopper was in May 10's "Bananas: Diversity Equity Inclusion".