Monday, April 2, 2018

Some Of NVIDIA's Chips Are Getting the Wrong Answer For Math Problems (NVDA)

This doesn't sound as earthshaking as the Intel  Pentium FDIV bug which was discovered in 1994 but it is troubling.
From The Register:

 2 + 2 = 4, er, 4.1, no, 4.3... Nvidia's Titan V GPUs spit out 'wrong answers' in scientific simulations
Fine for gaming, not so much for modeling, it is claimed
Nvidia’s flagship Titan V graphics cards may have hardware gremlins causing them to spit out different answers to repeated complex calculations under certain conditions, according to computer scientists.

The Titan V is the Silicon Valley giant's most powerful GPU board available to date, and is built on Nv's Volta technology. Gamers and casual users will not notice any errors or issues, however folks running intensive scientific software may encounter occasional glitches.

One engineer told The Register that when he tried to run identical simulations of an interaction between a protein and enzyme on Nvidia’s Titan V cards, the results varied. After repeated tests on four of the top-of-the-line GPUs, he found two gave numerical errors about 10 per cent of the time. These tests should produce the same output values each time again and again. On previous generations of Nvidia hardware, that generally was the case. On the Titan V, not so, we're told.
We have repeatedly asked Nvidia for an explanation, and spokespeople have declined to comment. With Nvidia kicking off its GPU Technology Conference in San Jose, California, next week, perhaps then we'll get some answers.

All in all, it is bad news for boffins as reproducibility is essential to scientific research. When running a physics simulation, any changes from one run to another should be down to interactions within the virtual world, not rare glitches in the underlying hardware.

Collisions
Take for instance software that models molecular interactions. This sort of code uses Newtonian equations to predict the state of a system at any given time, such as calculating the position of particles after collisions. If a simulation has the same environment and starts with the same conditions, the output should be the same, again and again. But that isn’t always the case when using Nvidia’s Titan V GPUs to crunch the numbers.

An industry veteran, who alerted us to the issue, reckoned this is due to a memory issue. Chip companies normally push their high-end silicon to the limit to maximize performance. Nvidia may be overclocking or red-lining its Titan V in some way, causing read errors from memory. These mistakes are carried forward in calculations, resulting in numerical errors. Another cause could be a design blunder.

It is not down to random defects in the chipsets nor a bad batch of products, since Nvidia has encountered this type of cockup in the past, we are told. The moneybags biz released patches for some of its older GeForce and Titan models that exhibited similar problems to address these errors. There was no issue with its Titan X card based on its Pascal architecture, we're told....
 ...MORE