Climateer Investing: "LLM Benchmarking Shows Capabilities Doubling Every 7 Months"

Thursday, July 3, 2025

"LLM Benchmarking Shows Capabilities Doubling Every 7 Months"

That's 2 1/2 times faster than Moore's "Law" is/was for chips.

From the electrical wizards at IEEE Spectrum, July 2:

By 2030, LLMs may do a month’s work in just hours

The main purpose of many large language models (LLMs) is providing compelling text that’s as close as possible to being indistinguishable from human writing. And therein lies a major reason why it’s so hard to gauge the relative performance of LLMs using traditional benchmarks: Quality of writing doesn’t necessarily correlate with metrics traditionally used to measure processor performance, such as instruction execution rate.
But researchers at the Berkeley, Calif., think tank METR (for Model Evaluation & Threat Research) have come up with an ingenious idea. First, identify a series of tasks with varying complexity and record the average time it takes for a group of humans to complete each task. Then have various versions of LLMs complete the same tasks, noting cases in which a version of an LLM successfully completes the task with some level of reliability, say 50 percent of the time. Plots of the resulting data confirm that as time goes on, successive generations of an LLM can reliably complete longer and longer (more and more complex) tasks.
No surprise there. But the shock was that this improvement in the ability of LLMs to reliably complete harder tasks has been exponential, with a doubling period of about seven months.
IEEE Spectrum reached out to Megan Kinniment, one of the authors of an METR research paper describing this work and its surprising implications.
Evaluating LLM Performance Metrics
Did you suspect that you’d get these results?
Megan Kinniment: I, at least personally, didn’t expect us to have quite as clear an exponential as we did. Models have definitely been getting better quickly, though. So some fast rate of progress wasn’t entirely unexpected.
As you point out in the paper, it’s always dangerous to look into the future and extrapolate. However, you suggest that there is a likelihood of this continuing, which means that by 2030 we’ll be looking at monthlong tasks being within the capability of the most advanced large language models.
Kinniment: Let’s have a look at that. By one month, we mean around 167 working hours, so the number of [human] working hours in a month. And that’s at 50 percent reliability. But longer tasks typically seem to require higher reliability to actually be useful. So that’s something that could make the in-practice, real-world, economic impacts not be as intense as what is predicted.
There are a number of things that would have to continue for this prediction to come true. Hardware would have to continue improving at roughly the rate it’s improving; software would have to keep improving. You would have to have sufficient training data and availability of that training data to continue training at the breathtaking clip that’s been occurring in recent years.
Kinniment: The forecasts and the dates that we’ve found are just extrapolating the trend that we see on our task suite. [The trends are] not taking into account real-world factors or compute-scaling changes.
If a large language model could somehow achieve the ability to complete 167-hour type tasks with 50 percent reliability, what are the kinds of things that that now puts in the realm of capability for a large language model?
Kinniment: Well, the big one that we often think about is accelerating AI R&D research itself. To the extent that you can make models that accelerate your company’s ability to make better models, you could end up in a situation where AI capabilities develop really quite rapidly....

....MUCH MORE

Also at IEEE Spectrum:

Large Language Models Are Improving Exponentially

....MUCH MORE