"Hello, I am Featureiman Byeswickattribute argue"

Thus concludes chapter 4 of Build a Large Language Model (from Scratch). Coding along with the book’s examples, I now have an untrained version of GPT-2, first released in 2019. When fed with “Hello, I am” as a prompt, the untrained model outputs gibberish. This post’s title is taken from that gibberish.

Next comes Chapter 5, which will cover the training that will take us from gibberish to intelligible text. But for this post, I wanted to take the time to capture my thoughts at this point in the book.

Rather than explaining concepts that others have covered better, I’ll share my stream of consciousness about how fascinating and weird it is that this stuff works at all.

That’s a hella lot of numbers

That the GPT-2 Small model we have contains 163,009,536 trainable numbers (parameters). GPT-2 XL has 1,637,792,000 (1.6 billion) of them. Current frontier models have many billions more.

Writing the code has helped me grasp just how many trainable numbers (parameters) are in play here. In having to spend the time typing out the code, one’s brain has to also spend time in the company of this immense number of numbers. That’s a lot more time than just skimming over the number on a page.

This vast scale provides a lot of opportunity to store patterns, and language is nothing if not full of patterns. The models have billions of numbers to play with, each individually tuned to contribute in its own small but particular way.

The magic of embeddings

The ability to encode meaning from text in such a way that similar things are “near to” each other in a mathematically described space is what allows transformer blocks to be meaningful. The translation from abstract concept to concrete mathematical construct feels like magic.

Seeing the complete GPT-2 architecture gives the intuition that making the embedding generation part of the model itself gives the model “more space” to develop meaning and reasoning capabilities. The embeddings can “work with” and be “tuned for” the overall architecture of the model. It’s not just a pre-processing step; it’s integral to how the model understands language.

I still can’t quite believe that vectors of numbers can capture meanings so fluidly—that “robin,” “sparrow,” and “bluetit” will be close to each other, near concepts like “wing” and “beak,” within the larger category of “birds.” But “wing” also needs to be close to “manta ray” and “beak” to “parrot fish”.

Without this magic, none of the other stuff would be possible.

Attention

Attention relies on word embedding vectors due to its use of dot-products to identify the “most related” words. As the model creates the embeddings for words during its training, the embeddings become highly specific to the model, and entwined with the other trainable weights within the “real” neural network layers.

Within the attention module, further trainable weight vectors (the query, key and value weight matrices) introduce a further layer of indirection by allowing the model to manipulate the embedding vectors before using dot-product on them. All these layers of extra trainable values provide many places for the model to encode some form of processing and meaning.

Of course, the way that transformers are placed in serial means that only the first transformer literally sees the raw embeddings as its input. If that first transformer saw something meaningless, however, the entire model’s output is likely nonsense — just like we saw above from the untrained model’s “predictions”. Embeddings really are the cornerstone of the models.

The query and key weight matrices allow the model to train in transformations/projections for the raw embedding vectors that are better for understanding particular relations between words in a sentence (versus the meaning of words in the abstract that embeddings capture). Once transformed, taking the dot-product of the query and key produces a matrix in which high values represent… something that later proves useful in the next word prediction task.

What that something is, is kind of a mystery. But good progress is being made in starting to understand it. Anthropic’s recent paper, On the Biology of a Large Language Model, is a fascinating discussion on the matter.

So: every word is exploded into thousands of numbers, which are combined via dot-products with other thousands of numbers that have been selected via exhaustive training. And then all those numbers are passed down the model to be further multiplied again, and again, and again… and somehow, useful text appears. Magic I tell you.

Beyond next word prediction

Alternatively, does training for next word prediction only make models good for next word prediction?

My intuition is that the models are generalising beyond next word prediction when they update their weights during training. It feels like the next word prediction training task forces them to understand language more deeply, embedding concepts rather than merely word probabilities.

Again, On the Biology of a Large Language Model has useful things to say. Anthropic’s team found tantalising evidence that Claude Haiku was planning further ahead than the next word when creating (rather awful) poetry. Many of the findings in that paper suggest an internalisation of deeper concepts.

It’s like how improved reflexes from playing video games can be useful outside of gaming—next word prediction might be a good reward function for learning more general concepts.

Different ways to interact

A modality is a way of interacting with a model — text is a modality, as is audio.

Models have traditionally been fixed in their modality. While LLMs could talk about lots of different things, they could still only talk (in text, on screen). Other trained networks were able to do other things, such as recognise images or, like Whisper, transcribe audio to text. Previously, we’ve connected different models together to enable interaction in different modalities.

Of course, multi-modal models are now a reality — like GPT-4o. GPT-4o can accept input and generate output in text, image and audio. That covers vision and audio, but what about the other senses? Are they far behind? Perhaps not: Google are looking at smell already 😬.

OpenAI note that their previous “voice chat with GPT” used a pipeline — voice recognition -> GPT-4 -> voice synthesis, using text between each stage. GPT-4o can do it all in one model, and is much faster for it.

A less obvious consequence is that GPT-4o is also able to use more information; as well as the words, speech carries information in tone of voice, prosody and more. GPT-4o can, in theory, make use of this in a way that the pipelined version could not.

The ability of a model to take in an image and output text about the image isn’t new with GPT-4o or LLMs in general, it’s been around a while. I think these models provide strong support for the notion that the models are embedding much deeper meaning within them than just statistics about language.

Building things is helpful. Again.

Following the books guidance and creating the attention mechanism from scratch has allowed me to spend much more time with the ideas from neural networks than I have before. When I look at the vast sets of trainable numbers that are used inside LLMs, I feel drawn to parallels with physics.

Just as complex systems in physics can often be modeled with equations representing multi-dimensional curves or surfaces, LLMs’ sheer quantity of weights allow them to approximate extremely complicated functions. The non-linear activation layers after attention are key to this, enabling the model to learn curves rather than straight lines.

Perhaps the next-word training task has tuned the weights to describe deeper patterns in the real world by approximating some kind of mathematical functions that resolve to those patterns. I’m way out beyond my knowledge horizon at this point, however.

In particular, how do long strings of numbers capture the essence of “sparrow” or “discombobulation”? I’m not sure if the math is beyond me or if we simply found something that works without fully understanding why. More reading is needed.

It makes me giddy.

“Hello, I am Featureiman Byeswickattribute argue”