Thursday, March 27, 2025

"Anthropic can now track the bizarre inner workings of a large language model"

Excellent. We're getting closer to being able to explain to outside investors why they are losing money on the latest algo trading tool, recently implemented. 

On Saturday September 23, 2017, 6:28 AM PDT we posted "Cracking Open the Black Box of Deep Learning" with this introduction:

One of the spookiest features of black box artificial intelligence is that, when it is working correctly, the AI is making connections and casting probabilities that are difficult-to-impossible for human beings to intuit.
Try explaining that to your outside investors.

You start to sound, to their ears anyway, like a loony who is saying "Etaoin shrdlu, give me your money, gizzlefab, blythfornik, trust me."

See also the famous Gary Larson cartoons on how various animals hear and comprehend:...
Today [9/28/17] Bloomberg View's Matt Levine commends to our attention a story about one of the world's biggest hedge funds and prize-putter-upper of what's probably the most prestigious honor in  literature, short of the Nobel, the Man Booker Award.

On Tuesday September 26, 2017, 11:00 PM CDT Bloomberg posted:
The Massive Hedge Fund Betting on AI

The second paragraph of the story:
...Man Group, which has about $96 billion under management, typically takes its most promising ideas from testing to trading real money within weeks. In the fast-moving world of modern finance, an edge today can be gone tomorrow. The catch here was that, even as the new software produced encouraging returns in simulations, the engineers couldn’t explain why the AI was executing the trades it was making. The creation was such a black box that even its creators didn’t fully understand how it worked. That gave Ellis pause. He’s not an engineer and wasn’t intimately involved in the technology’s creation, but he instinctively knew that one explanation—“I can’t tell you why …”—would never fly with big clients looking for answers when Man inevitably lost some of their money... 

And with that stroll down memory lane behind us, we visit MIT's Technology Review, March 27, 2025 for the headliner:

What they found challenges some basic assumptions about how this technology really works.

The AI firm Anthropic has developed a way to peer inside a large language model and watch what it does as it comes up with a response, revealing key new insights into how the technology works. The takeaway: LLMs are even stranger than we thought.

The Anthropic team was surprised by some of the counterintuitive workarounds that large language models appear to use to complete sentences, solve simple math problems, suppress hallucinations, and more, says Joshua Batson, a research scientist at the company.

It’s no secret that large language models work in mysterious ways. Few—if any—mass-market technologies have ever been so little understood. That makes figuring out what makes them tick one of the biggest open challenges in science.

But it’s not just about curiosity. Shedding some light on how these models work would expose their weaknesses, revealing why they make stuff up and can be tricked into going off the rails. It would help resolve deep disputes about exactly what these models can and can’t do. And it would show how trustworthy (or not) they really are.

Batson and his colleagues describe their new work in two reports published today. The first presents Anthropic’s use of a technique called circuit tracing, which lets researchers track the decision-making processes inside a large language model step by step. Anthropic used circuit tracing to watch its LLM Claude 3.5 Haiku carry out various tasks. The second (titled “On the Biology of a Large Language Model”) details what the team discovered when it looked at 10 tasks in particular.

“I think this is really cool work,” says Jack Merullo, who studies large language models at Brown University in Providence, Rhode Island, and was not involved in the research. “It’s a really nice step forward in terms of methods.”

Circuit tracing is not itself new. Last year Merullo and his colleagues analyzed a specific circuit in a version of OpenAI’s GPT-2, an older large language model that OpenAI released in 2019. But Anthropic has now analyzed a number of different circuits as a far larger and far more complex model carries out multiple tasks. “Anthropic is very capable at applying scale to a problem,” says Merullo.

Eden Biran, who studies large language models at Tel Aviv University, agrees. “Finding circuits in a large state-of-the-art model such as Claude is a nontrivial engineering feat,” he says. “And it shows that circuits scale up and might be a good way forward for interpreting language models.”

Circuits chain together different parts—or components—of a model. Last year, Anthropic identified certain components inside Claude that correspond to real-world concepts. Some were specific, such as “Michael Jordan” or “greenness”; others were more vague, such as “conflict between individuals.” One component appeared to represent the Golden Gate Bridge. Anthropic researchers found that if they turned up the dial on this component, Claude could be made to self-identify not as a large language model but as the physical bridge itself....

....MUCH MORE 

I'll note right now it's only with Levine that you get:

"I imagine a leather-clad dominatrix standing over the computer, ready to administer punishment as necessary."