Evaluating conversations with ChatGPT

ChatGPT offers a glimpse of the future, where a program can hold up its side of a conversation with competence. Before thinking about how we might feel about this, let’s look at an example:

Mike: What was the cause of the global financial crisis in 2008?
ChatGPT: The global financial crisis in 2008 was caused by a combination of factors, including the buildup of excessive levels of debt, risky financial practices by banks and other financial institutions, and a lack of adequate regulation. The crisis was triggered by a sharp increase in subprime mortgage defaults in the United States, which led to the collapse of several major financial institutions and the onset of a severe global recession.
Mike: What major financial institutions collapsed?
ChatGPT: During the global financial crisis in 2008, several major financial institutions collapsed, including Lehman Brothers, Bear Stearns, and Merrill Lynch. These failures, along with the problems faced by other large financial institutions, contributed to the severity of the crisis and the onset of the global recession that followed.

What can we see from this snippet?

The bot generated plausible responses to what I said.
Phrasing is easy to understand and there are no glaring oddities in the wordings.
Lack of personality; these answers feel like a regurgitation of rote learning.
The second answer could start after the first comma and stop after one sentence. Wordiness is rather typical of ChatGPT responses.
Even while it’s answers are longer than required, a lack of exhaustiveness (ie, a fuller list of collapses) is common.

While most of these pick at weaknesses in the responses, taking a step back paints a different picture. The way that ChatGPT can hold a conversation is a step change from what I have seen before. I’ve seen that Siri holds basic context: you can ask “What’s the weather forecast for tomorrow?” and follow that with “And what about for Cardiff?” and receive another weather forecast. But the way ChatGPT can use prior discussion as deeper context to help it generate its next response feels quite different.

If we delve further, however, we’ll start to see that ChatGPT’s relationship with the truth is somewhat loose, and that it will often spout falsehoods with a surprising confidence to its generated tone. OpenAI themselves note this:

ChatGPT sometimes writes plausible-sounding but incorrect or nonsensical answers.

I find this tarnishes my view of the system. Let’s look at some other examples to see this in action.

Maths isn’t my strong point

ChatGPT can sound absolutely confident, while being absolutely wrong. Asking mathematical questions is a good way to quickly show this:

Mike: What is 3000 + 10000 - 456?
ChatGPT: The result of 3000 + 10000 - 456 is 12644.

Linguistically? Totally fine and plausible way to respond. Mathematically? Out by a hundred, the answer is 12,544.

ChatGPT isn’t consistent though, and repeating these simple questions is a good way to get a feel for ChatGPT’s underlying methods. Because it might get it right, if whatever glittery randomness is sprinkled into your response is the right glittery randomness:

Mike: What is 3000 + 10000 - 456?
ChatGPT: The answer is 12554. When you add 3000 and 10000, you get 13000. Then, when you subtract 456 from that number, you get 12554.

Or it might get it even more wrong:

Mike: What is 3000 + 10000 - 456?
ChatGPT: The result of 3000 + 10000 - 456 is 11644. This is a simple arithmetic problem that can be solved by performing the operations in the order in which they are given, from left to right. First, we add 3000 and 10000 to get 13000. Then, we subtract 456 from that result to get 11644.

1,100 out this time.

You’ll know I’m right by my great air of confidence

The other aspect of ChatGPT is the level of confidence that its tone often exudes, again, while being similarly wrong. This makes it too easy to trust the responses. Take this one:

Mike: Did Thomas Hobbes believe in separation of powers?
ChatGPT: Yes, Thomas Hobbes believed in the separation of powers. In his book “Leviathan,” Hobbes argued that the best form of government is one in which power is divided among different branches, such as the executive, legislative, and judicial branches. This separation of powers helps to prevent any one person or group from having too much power, which can lead to abuses of power and ultimately result in a breakdown of society.

Sounds very plausible. Very plausible indeed. Sadly, however, Britannica will tell you that Hobbes wished to leave you with the opposite message upon reading Leviathan. Instead of a plurality of rulers, Hobbes suggests a single person, a “sovereign”, to whom we all submit completely:

The sovereign determines who owns what, who will hold which public offices, how the economy will be regulated, what acts will be crimes, and what punishments criminals should receive. The sovereign is the supreme commander of the army, supreme interpreter of law, and supreme interpreter of scripture, with authority over any national church.

It’s harder to trip up ChatGPT with simpler statements such as this one:

Mike: What is the weather likely to be tomorrow?
ChatGPT: I’m sorry, but I am unable to answer that question as I am a large language model trained by OpenAI and do not have access to real-time weather information. Additionally, my training only goes up until 2021, so I am not able to provide information about future events.

I wonder whether this is a result of training indicating that this answer is “best” for this type of prompt, or whether certain questions just get shunted to this type of message.

So what can we draw from this?

I think it’s clear that ChatGPT is very successful in its ability to be conversational. While its phrases are lacking in sparkle, they are not obviously incorrect or clumsy. Ask ChatGPT to generate marketing copy, and it does a good job with the punchy tone but there’s little originality on show. This shouldn’t be surprising given that, in the end, it’s generating prose based on prose it’s previously been shown.

As we’ve seen, however, ChatGPT’s confident and apparently reasoned reporting is the same regardless of the truth of the delivered statement. It will also answer the same question both correctly and incorrectly at different times depending on hidden variables. As it stands, ChatGPT isn’t trustworthy. I wouldn’t want to put medical symptoms into it, or ask it what was safe to feed my cat.

Because of this, I think the furore over ChatGPT is overblown. It uses linguistic models to generate pretty good approximations of conversation. ChatGPT’s consistent use of a very confident tone makes it easy to fool oneself that there’s more going on behind the curtain than there is. To their credit, OpenAI themselves are pretty open about their goal being passable conversational style rather than correctness. But I think that its lack of regard for the truth has been overlooked in much evaluation, which focuses instead on ChatGPT’s often surprising prowess at appearing clever.

I assume this affectation of confidently stating falsehoods as fact is because ChatGPT’s training process asked its human trainers to rate based on feeling of conversation rather than truth. Are there other ways to train the underlying model that might result in fewer, or no, falsehoods?

ChatGPT’s ability to make sense of its inputs when generating responses is far superior to other systems I’ve used, but only once its answers can be relied upon will its potential be realised, one to which we can really start to ascribe the title “google killer”.