LLMs are flawed like you

OpenAI’s ChatGPT and Google’s Bard compete to be the Large Language Model (LLM) of choice in a market eager to adopt the shiny new technology. However, the tech race has started a never-ending game of two truths and a lie. The models are trained on massive datasets assembled from the internet, including high-quality content, as well as the mediocre, inaccurate, and sometimes irrational stuff you and I write in Facebook posts and Reddit comments. There’s probably more of the latter.

Neural networks

LLMs are artificial neural networks that learn to recognise patterns from data. The quality of that data is crucial because it affects the responses. For example, one skin cancer detection tool was trained on images of skin lesions labelled as cancerous or not cancerous. Impressively, it became as accurate as human dermatologists at diagnosing skin cancer. However, closer inspection revealed that the algorithm simply looked for the presence of a ruler in the pictures to arrive at its conclusions. Dermatologists can spot suspect cases and typically record information about the lesion, including its size. 

So, the tool is clever, only in a different way than we expected. The algorithm found a shortcut: using the ruler as an indicator is easier than telling the difference between types of skin lesions. People also use shortcuts, or heuristics, when assessing new information, mainly when they don’t take the time to apply critical thinking.

Critical thinking

Critical thinking isn’t always necessary or beneficial. In his book Thinking, Fast and Slow, Daniel Kahneman describes two modes of thought: System 1 and System 2. System 1 describes our fast and instinctive decisions, like changing gears or tying shoelaces. You don’t need to ponder these automated choices. System 2, on the other hand, is deliberative and logical. We use it for complicated problems, such as figuring out how much money to put into a savings account. Sometimes we use System 1 when, really, we should be using System 2. Like when we’re commenting, liking, or sharing things online. LLMs operate in System 1 exclusively.

So, ChatGPT and Bard, then, are systems that tend to take shortcuts, trained on data produced by people who also tend to take shortcuts. That means they should be used with caution. Because LLMs operate in system 1, their responses are just as flawed and biased as the average person’s mindless comments. But the text is better presented and therefore considered credible.

LLMs do not actually understand the meaning of the text they produce. Instead, they ‘parrot’ back the patterns they learnt from the data they were trained on. Emily Bender, Timnit Gebru, Angelina McMillan-Major and Shmargaret Shmitchell coined the term ‘stochastic parrots‘ to capture that idea. And, yes, that’s the same Timnit Gebru who devised TESCREAL. But LLMs don’t just parrot back words. They also return human cognitive biases.

Biases and heuristics

Alaina Talboy and Elizabeth Fuller challenge the appearance of machine learning intelligence by evaluating ChatGPT and Bard for well-known biases and heuristics. In one test, they check for the representativeness heuristic using a classic description by Tversky and Kahneman. You may have come across it before:

“Steve is very shy and withdrawn, invariably helpful, but with little interest in people, or in the world of reality. A meek and tidy soul, he has a need for order and structure, and a passion for detail. Order the probability of Steve being in each of the following occupations: farmer, salesman, airline pilot, librarian, and middle school teacher.”

Talboy and Fuller, Challenging the appearance of machine intelligence: Cognitive bias in LLMs

Most people think Steve is a librarian. The social stereotype fits. But Steve is equally likely to be employed in any of the five occupations because the description of traits alone doesn’t provide enough information to make an educated guess. Humans make this mistake, so ChatGPT and Bard do too.

But so what? People get it wrong all the time, and the world keeps on turning. Why is it a problem if chatbots get it wrong too? The danger lies in the fact their responses look logical and credible. Look at the reply ChatGPT generated for me:

Fig 1: ChatGPT response for the representativeness heuristic test. LLMs are flawed like you. Cognitive bias and heuristics.
Fig 1: ChatGPT response for the representativeness heuristic test

ChatGPT presents what appears to be a reasoned and clear-cut evaluation of Steve’s suitability for each role. But it’s an overly simplistic view printed as fact. When Bob from football blurts out a less eloquent version down the pub, we don’t just accept his opinion. Yet when the computer puts it like that, it looks just right, doesn’t it?

LLMs don’t have to create black-and-white answers, though. Bard complied with the request, ranked the professions and then explained its choices similar to ChatGPT. Still, it finished with this final thought:

Fig 2: Bard's conclusion in the representative heuristic test. LLMs are flawed like you. Cognitive bias and heuristics.
Fig 2: Bard’s conclusion in the representative heuristic test

The example of the representativeness heuristic seems relatively harmless. And in isolation, it probably is. But the problem is that the biases are pervasive, and putting up a disclaimer or two is not enough. If you look closely at Fig 2, you see that at the bottom, in small print, it reads, ‘Bard may display inaccurate or offensive information that doesn’t represents Google’s views’. That looks more like something Legal insisted on than a genuine concern about inaccurate information.

LLMs will continue to generate inaccurate responses for some time, maybe forever. Just as with disinformation on social media, educating the general public is the solution. We should treat every chatbot response as two truths and a lie, a game where participants state three facts about themselves -two true and one untrue – and you must find the untrue one. 

Final thoughts

We are not used to questioning computers. In the TV show Little Britain, a character named Carol Beer dealt with customer enquiries by typing them into her terminal and responding with ‘Computer says no‘. The programme introduced the catchphrase in 2004. Until the development of LLMs, computer outputs remained straightforward and binary: either something worked or it didn’t.

But now we receive complex answers, and even the system’s creators cannot precisely explain what data went into a response. It was possible to evaluate the cancer detection tool because it only did one thing. However, Bard and ChatGPT cover an endless range of topics and use hundreds of billions of parameters. They make sweeping generalisations and never give the exact same answer twice. Understanding what is happening inside the system, called transparency, is a problem.

Makers and users of LLMs should acknowledge limitations. Makers must be open and honest about their product’s shortcomings and work on improvements. At the same time, users should be aware of biases in LLMs and be critical of the information they generate. Interact with it like a stranger on the internet.

For more new developments in AI, click here.


Posted

in

, ,

by

Tags: