Monday, October 16, 2023

Chips/Supercomputers: "Tachyum says someone will build 50 exaFLOPS super with its as-yet unfinished chips"

From The Register, October 16:

'It's a huge, effing big machine'

Interview Tachyum's first chip Prodigy hasn't even taped out - let alone gone into mass production - but one customer has, we're told, committed to buying hundreds of thousands of the processors to power a massive 50 exaFLOPS supercomputer.

Usually when we see numbers like that, the obvious assumption is they're talking about AI FLOPS using 8- or 16-bit floating-point math precision, not the 64-bit double-precision calculations typically used in high-performance computing. But Tachyum claimed the system will be capable of 25x the performance of "the world's fastest conventional supercomputer built just this year."

This appears to be a reference to the newly inaugurated Aurora supercomputer at Argonne National Labs, which boasts more than two exaFLOPS of peak FP64 performance.

If Tachyum's claim wasn't already wild enough, the processor designer claims the forthcoming system will be capable of eight zetaFLOPS of AI performance for large language models and will boast hundreds of petabytes of DDR5 memory, when it's completed in 2025.

A paper Prodigy
To fully understand the scope of what Tachyum is planning we need to take a closer look at the chip the company has spent the past few years developing and redeveloping.

Tachyum describes Prodigy as a universal processor. As the name suggests, this isn't some specialized chip designed solely to accelerate AI or HPC workloads. It's intended as a general-purpose component that can run any workload you might throw at. The emulator QEMU has been ported to Prodigy's architecture to run today's x86, Arm, and RISC-V code.

It's "a CPU which integrates AI and HPC for free. That's our story," CEO Radoslav Danilak told The Register.

Taking a look at Tachyum's latest renders, we can see most of Prodigy's 600mm2 die is dedicated to its 192 64-bit processor cores. So it'll also be a decently large chip, but not as big as Nvidia's GH100 at 814mm2. The cores feature a custom instruction set architecture and will execute four out-of-order instructions per clock cycle at frequencies in excess of 5 GHz, according to the data sheet [PDF] at least.

To appeal to the Big Iron markets the cores will feature double 1024-bit vector processing capabilities with native support for matrix math. This Danilak said increases the "amortization of the CPU overhead fetch, decode, scheduling, and so on compared to data path by order of magnitude."

According to Tachyum, these cores are also fast, with the top-specced part supposedly capable of 90 teraFLOPS of FP64 performance and 12 petaFLOPS of FP8 with sparsity. By employing greater degrees of scarcity the upstart biz boasts that the chip will be capable of 48 petaFLOPS.

"We will be showing, in the next 30 days — publishing a paper — these measurements from these industry standard benchmarks, where we basically achieve, including training, two bits per weight," Danilak said.

To put those numbers in perspective, Tachyum is essentially saying its chip is going to deliver 3x the performance of Nvidia's H100 SXM modules in both HPC and AI workloads. Though - we'll note that given the timeline - the H100 more than likely won't be the chip Prodigy has to contend with; we expect Nvidia to announce its next-gen accelerator next spring....

....MUCH MORE