“create a song in spongebob style” will be cut into tokens which are roughly syllables (out of 50257 possible tokens), and each token is converted to a list of 12288 numbers. Each token always maps to the same list, called its embedding; the conversion table is called the token embedding matrix. Two embeddings with a short distance occur within similar concepts.
Then each token’s embedding is roughly multiplied with a set of matrices called “attention head” that yield three lists: query, key, value, each of 128 numbers behaving somewhat like a fragment of an embedding. We then take the query lists for the past 2048 tokens, and multiply each with the key lists of each of those 2048 tokens: the result indicates how much a token influences another. Each token’s value list get multiplied by that, so that the output (which is a fragment of an embedding associated with that token, as a list of 128 numbers) is somewhat proportional to the value list of the tokens that influence it.
We compute 96 attention heads in parallel, so that we get 128×96 = 12288 numbers, which is the size of the embedding we had at the start. We then multiply each with weights, sum the result, pass it through a nonlinear function; we do it 49152 times. Then we do the same again with other weights, but only 12288 times, so that we obtain 12288 numbers, which is what we started with. This is the feedforward layer. Thanks to it, each fragment of a token’s embedding is modified by the other fragments of that token’s embedding.
Then we pass that output (a window of 2048 token embeddings, each of 12288 numbers) through another multi-attention head, then another feedforward layer, again. And again. And again. 96 times in total.
Then we convert the output to a set of 50257 numbers (one for each possible next token) that give the probability of that token being the next syllable.
The token embedding matrix, multi-head attention weights, etc. have been learned by computing the gradient of the cross-entropy (ie. roughly the average likelihood of guessing the next syllable) of the model’s output, with respect to each weight in the model, and nudging the weights towards lower entropy.
So really, it works because there is a part of the embedding space that knows that a song is lyrical, and that a part of the attention head knows that sponge and bob together represent a particular show, and that a part of the feedforward layer knows that this show is near “underwater” in the embedding space, and so on.
Then each token’s embedding is roughly multiplied with a set of matrices called “attention head” that yield three lists: query, key, value, each of 128 numbers behaving somewhat like a fragment of an embedding. We then take the query lists for the past 2048 tokens, and multiply each with the key lists of each of those 2048 tokens: the result indicates how much a token influences another. Each token’s value list get multiplied by that, so that the output (which is a fragment of an embedding associated with that token, as a list of 128 numbers) is somewhat proportional to the value list of the tokens that influence it.
We compute 96 attention heads in parallel, so that we get 128×96 = 12288 numbers, which is the size of the embedding we had at the start. We then multiply each with weights, sum the result, pass it through a nonlinear function; we do it 49152 times. Then we do the same again with other weights, but only 12288 times, so that we obtain 12288 numbers, which is what we started with. This is the feedforward layer. Thanks to it, each fragment of a token’s embedding is modified by the other fragments of that token’s embedding.
Then we pass that output (a window of 2048 token embeddings, each of 12288 numbers) through another multi-attention head, then another feedforward layer, again. And again. And again. 96 times in total.
Then we convert the output to a set of 50257 numbers (one for each possible next token) that give the probability of that token being the next syllable.
The token embedding matrix, multi-head attention weights, etc. have been learned by computing the gradient of the cross-entropy (ie. roughly the average likelihood of guessing the next syllable) of the model’s output, with respect to each weight in the model, and nudging the weights towards lower entropy.
So really, it works because there is a part of the embedding space that knows that a song is lyrical, and that a part of the attention head knows that sponge and bob together represent a particular show, and that a part of the feedforward layer knows that this show is near “underwater” in the embedding space, and so on.