Friday, February 2, 2024

"ChatGPT is a black box: how AI research can break it open"

 From the journal Nature, July 25, 2023:

Despite their wide use, large language models are still mysterious. Revealing their true nature is urgent and important. 

“I propose to consider the question, ‘Can machines think?” So began a seminal 1950 paper by British computing and mathematics luminary Alan Turing (A. M. Turing Mind LIX, 433–460; 1950).

But as an alternative to the thorny task of defining what it means to think, Turing proposed a scenario that he called the “imitation game”. A person, called the interrogator, has text-based conversations with other people and a computer. Turing wondered whether the interrogator could reliably detect the computer — and implied that if they could not, then the computer could be presumed to be thinking. The game captured the public’s imagination and became known as the Turing test.

Although an enduring idea, the test has largely been considered too vague — and too focused on deception, rather than genuinely intelligent behaviour — to be a serious research tool or goal for artificial intelligence (AI). But the question of what part language can play in evaluating and creating intelligence is more relevant today than ever. That’s thanks to the explosion in the capabilities of AI systems known as large language models (LLMs), which are behind the ChatGPT chatbot, made by the firm OpenAI in San Francisco, California, and other advanced bots, such as Microsoft’s Bing Chat and Google’s Bard. As the name ‘large language model’ suggests, these tools are based purely on language.

With an eerily human, sometimes delightful knack for conversation — as well as a litany of other capabilities, including essay and poem writing, coding, passing tough exams and text summarization — these bots have triggered both excitement and fear about AI and what its rise means for humanity. But underlying these impressive achievements is a burning question: how do LLMs work? As with other neural networks, many of the behaviours of LLMs emerge from a training process, rather than being specified by programmers. As a result, in many cases the precise reasons why LLMs behave the way they do, as well as the mechanisms that underpin their behaviour, are not known — even to their own creators.

As Nature reports in a Feature, scientists are piecing together both LLMs’ true capabilities and the underlying mechanisms that drive them. Michael Frank, a cognitive scientist at Stanford University in California, describes the task as similar to investigating an “alien intelligence”.

Revealing this is both urgent and important, as researchers have pointed out (S. Bubeck et al. Preprint at https://arxiv.org/abs/2303.12712; 2023). For LLMs to solve problems and increase productivity in fields such as medicine and law, people need to better understand both the successes and failures of these tools. This will require new tests that offer a more systematic assessment than those that exist today.

Breezing through exams
LLMs ingest enormous reams of text, which they use to learn to predict the next word in a sentence or conversation. The models adjust their outputs through trial and error, and these can be further refined by feedback from human trainers. This seemingly simple process can have powerful results. Unlike previous AI systems, which were specialized to perform one task or have one capability, LLMs breeze through exams and questions with a breadth that would have seemed unthinkable for a single system just a few years ago.

But as researchers are increasingly documenting, LLMs’ capabilities can be brittle. Although GPT-4, the most advanced version of the LLM behind ChatGPT, has aced some academic and professional exam questions, even small perturbations to the way a question is phrased can throw the models off. This lack of robustness signals a lack of reliability in the real world.

Scientists are now debating what is going on under the hood of LLMs, given this mixed performance. On one side are researchers who see glimmers of reasoning and understanding when the models succeed at some tests. On the other are those who see their unreliability as a sign that the model is not as smart as it seems....

....MUCH MORE