Tuesday, September 3, 2024

Chips: The Pivot From Training To Inference (where Nvidia's dominance isn't as strong)

Elon Musk earlier today tweeted that xAI's supercomputer was built in under 17 weeks, just incredible. However...there's always an "however"....this means that the chips have been purchased, the die has been cast, etc.

See PC Mag, September 3: "Musk's xAI Supercomputer Goes Online With 100,000 Nvidia GPUs".

Additionally, the big buyers (AMZN, META, MSFT) will be transitioning the usage of their computers from training to inference and it is here we join the electronic geniuses at IEEE Spectrum, August 28:

AI Inference Competition Heats Up
First MLPerf benchmarks for Nvidia Blackwell, AMD, Google, Untether AI

While the dominance of Nvidia GPUs for AI training remains undisputed, we may be seeing early signs that, for AI inference, the competition is gaining on the tech giant, particularly in terms of power efficiency. The sheer performance of Nvidia’s new Blackwell chip, however, may be hard to beat.

This morning, ML Commons released the results of its latest AI inferencing competition, ML Perf Inference v4.1. This round included first-time submissions from teams using AMD Instinct accelerators, the latest Google Trillium accelerators, chips from Toronto-based startup UntetherAI, as well as a first trial for Nvidia’s new Blackwell chip. Two other companies, Cerebras and FuriosaAI, announced new inference chips but did not submit to MLPerf.

Much like an Olympic sport, MLPerf has many categories and subcategories. The one that saw the biggest number of submissions was the “datacenter-closed” category. The closed category (as opposed to open) requires submitters to run inference on a given model as-is, without significant software modification. The data center category tests submitters on bulk processing of queries, as opposed to the edge category, where minimizing latency is the focus.

Within each category, there are 9 different benchmarks, for different types of AI tasks. These include popular use cases such as image generation (think Midjourney) and LLM Q&A (think ChatGPT), as well as equally important but less heralded tasks such as image classification, object detection, and recommendation engines.

This round of the competition included a new benchmark, called Mixture of Experts. This is a growing trend in LLM deployment, where a language model is broken up into several smaller, independent language models, each fine-tuned for a particular task, such as regular conversation, solving math problems, and assisting with coding. The model can direct each query to an appropriate subset of the smaller models, or “experts”. This approach allows for less resource use per query, enabling lower cost and higher throughput, says Miroslav Hodak, MLPerf Inference Workgroup Chair and senior member of technical staff at AMD

The winners on each benchmark within the popular datacenter-closed benchmark were still submissions based on Nvidia’s H200 GPUs and GH200 superchips, which combine GPUs and CPUs in the same package. However, a closer look at the performance results paint a more complex picture. Some of the submitters used many accelerator chips while others used just one. If we normalize the number of queries per second each submitter was able to handle by the number of accelerators used, and keep only the best performing submissions for each accelerator type, some interesting details emerge. (It’s important to note that this approach ignores the role of CPUs and interconnects.)....

....MUCH MORE

One point not mentioned in the Spectrum story—because the chip wasn't submitted—is that Cerebras Systems (remember the chip as big as your head?) has an architecture advantage in the inference games. If interested see siliconANGLE, August 27:

Cerebras Systems throws down gauntlet to Nvidia with launch of ‘world’s fastest’ AI inference service 

Ambitious artificial intelligence computing startup Cerebras Systems Inc. is raising the stakes in its battle against Nvidia Corp., launching what it says is the world’s fastest AI inference service, and it’s available now in the cloud.

AI inference refers to the process of running live data through a trained AI model to make a prediction or solve a task. Inference services are the workhorse of the AI industry, and according to Cerebras, it’s the fastest-growing segment too, accounting for about 40% of all AI workloads in the cloud today.

However, existing AI inference services don’t appear to satisfy the needs of every customer. “We’re seeing all sorts of interest in how to get inference done faster and for less money,” Chief Executive Andrew Feldman told a gathering of reporters in San Francisco Monday.

The company intends to deliver on this with its new “high-speed inference” services. It believes the launch is a watershed moment for the AI industry, saying that 1,000-tokens-per-second speeds it can deliver is comparable to the introduction of broadband internet, enabling game-changing new opportunities for AI applications.

Raw power...

....MUCH MORE

Nvidia is well aware of Cerebras and has been quietly working their inference speeds but should all else fail, NVDA has $34.8 billion in cash and equivalents, friendly bankers and stock that seems pretty darn fungible with U.S. dollars. 

If you catch my drift.