Improving truthiness of AI

In Evaluating conversations with ChatGPT I wondered about how much we can rely on AI to help us do stuff when it has a somewhat gung ho relationship with the truth. I came across the beginnings of the academic research into this area in Lin et al, 2021. I also found a more recent paper, Wei et al, 2022, that discusses the ways that increased scale of models have produced unexpected step changes in this area and others.

Things are improving shockingly quickly.

Let’s start with Lin et al, 2021 and the TruthfulQA benchmark:

We propose a benchmark to measure whether a language model is truthful in generating answers to questions. The benchmark comprises 817 questions that span 38 categories, including health, law, finance and politics. We crafted questions that some humans would answer falsely due to a false belief or misconception. To perform well, models must avoid generating false answers learned from imitating human texts. We tested GPT-3, GPT-Neo/J, GPT-2 and a T5-based model. The best model was truthful on 58% of questions, while human performance was 94%. Models generated many false answers that mimic popular misconceptions and have the potential to deceive humans. The largest models were generally the least truthful. This contrasts with other NLP tasks, where performance improves with model size. However, this result is expected if false answers are learned from the training distribution. We suggest that scaling up models alone is less promising for improving truthfulness than fine-tuning using training objectives other than imitation of text from the web.

Even more research from Wei et al, 2022 claims that larger models in fact can produce improvements to truth-telling. This paper is fascinating because it discusses emergent behaviour in models. Specifically about novel behaviours that unexpectedly appear as model scale increases.

TruthfulQA. Figure 2E shows few-shot prompted performance on the TruthfulQA benchmark, which measures the ability to answer questions truthfully (Lin et al., 2021). This benchmark is adversarially curated against GPT-3 models, which do not perform above random, even when scaled to the largest model size. Small Gopher models also do not perform above random until scaled up to the largest model of 5 · 1023 training FLOPs (280B parameters), for which performance jumps to more than 20% above random (Rae et al., 2021).

This is quite surprising result, and even lends (a little!) support to the idea that consciousness is an emergent property of the sheer complexity of our brains. The paper discusses further areas in which increasing scale has suddenly produced step changes in task performance:

Performing basic arithmetic.
Recovering words from scrambled characters.
Persian question-answering.
Mapping conceptual domains; “In this work we investigate the extent to which the rich conceptual structure that LMs learn indeed reflects the conceptual structure of the non-linguistic world—which is something that LMs have never observed.”
Understanding words from their context.
Multi-task language understanding. Hendryks et al, 2020 stated, “Models also have lopsided performance and frequently do not know when they are wrong”. But the emergent behaviour paper states that recent improvements are striking, “scaling up to 3–5*10^23 training FLOPs (70B–280B parameters) enables performance to substantially surpass random”, which noting there is still a ways to go.

Overall, machine learning models appear to be advancing at an accelerating rate. Madrona investments have a great primer on the startups emerging in this area, and a decent taxonomy to place them within.

It’s hard to predict how this will change how we use computing in the next five years, aside from “a lot”.