Thursday, May 11, 2023

Do Your AI Training In The Cloud: Google and Nvidia Team Up (GOOG; NVDA)

From TechCrunch, May 11:

Google Cloud announces new A3 supercomputer VMs built to power LLMs

As we’ve seen LLMs and generative AI come screaming into our consciousness in recent months, it’s clear that these models take enormous amounts of compute power to train and run. Recognizing this, Google Cloud announced a new A3 supercomputer virtual machine today at Google I/O.

The A3 has been purpose-built to handle the considerable demands of these resource-hungry use cases.

“A3 GPU VMs were purpose-built to deliver the highest-performance training for today’s ML workloads, complete with modern CPU, improved host memory, next-generation Nvidia GPUs and major network upgrades,” the company wrote in an announcement.

Specifically, the company is arming these machines with Nvidia’s H100 GPUs and combining that with a specialized data center to derive immense computational power with high throughput and low latency, all at what they suggest is a more reasonable price point than you would typically pay for such a package.

If you’re looking for specs, consider it’s powered by 8 Nvidia H100 GPUs, 4th Gen Intel Xeon Scalable processors, 2TB of host memory and 3.6 TBs bisectional bandwidth between the 8 GPUs via NVSwitch and NVLink 4.0, two Nvidia technologies designed to help maximize throughput between multiple GPUs like the ones in this product.

These machines can provide up to 26 exaFlops of power, which should help improve the time and cost related to training larger machine learning models. What’s more, the workloads on these VMs run in Google’s specialized Jupiter data center networking fabric, which the company describes as, “26,000 highly interconnected GPUs.”....

....MORE

This is massively brute force, supercomputer style.