ELI6 why SPAG is better than just the default pretraining method (token context ...

TimPC · on July 17, 2024

The red and blue agents are effectively unlimited sources of true and false examples so you can get far more efficient scale than you can by pre training with labelled inputs. It’s also far more targeted on correct/incorrect rather than a notion of answer quality which doesn’t directly get at hallucination vs reality.

blueblaze0 · on July 18, 2024

This is impressive, but what prevents the blue agent from generating an incorrect proof of a "true example"? What prevents the red agent from generating a correct disproof of a "false example"? I'm curious how they managed to generate a truly unlimited source of correctly labeled examples.

HanClinto · on July 18, 2024

> "but what prevents the blue agent from generating an incorrect proof of a "true example"?

That's the role of the Verifier. It's not going to be perfect, and I'm sure some incorrect proofs of true examples slip through, but it's good enough to increase the quality of the model overall.

> "What prevents the red agent from generating a correct disproof of a "false example"?

And on the other side, it's counterbalanced by the rules engine (math) that can determine absolutely whether or not the right answer is given at the end.

The Red and the Blue agents are held in check by the tension between the math engine and the verifier, and they are free to fight back-and-forth within those parameters as long as they are able. Eventually, I think the Red agent loses the ability to attack effectively, and so that's the big limit on OpenAI's arrangement. This particular game isn't balanced enough for this training loop to continue infinitely.

Natsu · on July 18, 2024

But how do we know the answer you gave us wasn't generated by the sneaky prover? :)

HanClinto · on July 18, 2024

At least in the context of this game, we essentially check the answer with a calculator (which the Verifier program doesn't have access to).

HanClinto · on July 17, 2024

I don't think of SPAG as a replacement for pretraining. For SPAG to work effectively, I would think that it would have to start with an LLM that is pretrained with self-supervised / imitation learning on regular next-token prediction. Think of SPAG as more of a competitor to RLHF than to pretraining. RL is what gave AlphaGo the edge to finally go beyond merely imitating human games, and finally achieve something new.

RLHF isn't true RL, because it's still based on imitating human preferences, and has trouble going beyond that. Once it achieves the plateau of "human preference", then there's nowhere else to go. That's one theory of why LLMs are asymptotically approaching human-level performance -- we're limited by imitation, or at the very least -- human judgement. We need super-human judgement to achieve super-human performance, and that's where we need true RL.

But you asked me to ELI6, so here goes. Warning -- wall-of-text incoming:

<ELI6>

Similar to how small kids often play games to learn, programmers train LLMs (like ChatGPT) with simple games too. The first stage (kindof like kindergarten) is the "pretraining" or "imitation learning" phase. This is where we teach the LLM to imitate us one word at a time. We play a simple game where I say something, but then I stop suddenly, and it tries to guess the missing word that will come next. Like, "My favorite food is..." and the LLM tries to guess which word I'm thinking of. Or I'll say something with a missing word in the middle like: "At my _____ party, I opened a bunch of presents" -- and the LLM needs to guess what the missing word is. We only play this game one word at a time, and so it's a very simple game -- but it's very important to learn the basics of language. This is what we call "pretraining".

After the LLM gets good at that, they can graduate from Kindergarten and move to first grade. Here we play another game, and this is called "instruction-tuning" -- it's where we give it a set of instructions and it needs to do its best to obey. Like, "Arrange the letters T P C G A in alphabetical order" and it tries to get the right answer.

This is fun for a while, but sometimes we want to give it more complicated instructions. Things like "write me a poem about puppies" or "tell me a story about a dragon". And those are things that don't have answers that are clearly right or clearly wrong, but we still need to tell it if it did a good job or a bad job. How do we tell if it was a good poem, or a good story? Well, you need to have someone listen to them and judge it -- which means we need to have people read ALL these dragon stories and ALL these puppy poems and mark which ones are their favorites.

I like reading puppy poems and reading dragon stories, but if I had to do it all day every day, I think I would get pretty tired of it pretty fast, don't you?

So when people get tired of doing boring things, the best thing is to have a robot do their job! They can do the boring things (they never get tired of it!) and we get to go do fun things. So how do we train a robot to judge the poems?

Well, we use this technique called RLHF (Reinforcement Learning with Human Feedback), where we ask a bunch of people -- given Option A and Option B -- to say which one is their favorite. So they read two puppy poems at a time, and say "I prefer A" or "I prefer B".

Once we have a BUNCH of human feedback (and just about when the humans are getting super super tired and don't think they could read another poem), we take ALL that data and we use it to train a SEPARATE computer program (that functions like a Judge) whose job it is to try and predict which poem or story the human would prefer.

It doesn't always get the right answer, but it doesn't need to be perfect -- partly because humans aren't perfect, and different people might prefer different stories. Keep in mind, this Judge program can't write good puppy poems or dragon stories on its own -- it can only predict which poem or story a _human_ would prefer. It still needs the first program (the LLM) to actually write anything.

So now we use the LLM to write a bunch of stories and poems and things, and then grade them all (two at a time) with the second program. For every pair, when the Judge picks its favorite, then we tell the LLM "write more things like this, please!" and for the things the Judge didn't like, we tell the LLM "don't write like this anymore, plzkthx". And we do this over and over, millions of times, and eventually it can write okay poems and stories.

So this way, instead of needing to have humans sit there and read thousands and millions of puppy poems, humans can just read a few dozen / hundred, score them, and then the computer can use that to try and guess what humans would prefer for everything else that it tries. It's not as accurate as if we actually had a human read it all, but it's not too bad, and it seems to work pretty well.

But one problem of this method is that it's not perfectly accurate (the Judge doesn't always get it right), and the more complex the task, the less of a good job it does. It's still just trying to imitate what a human would prefer -- but even if it did its job perfectly, it's not going to get much above human preference (because that's its target). Plus, as you keep going up, it takes more and more data to make smaller and smaller improvements, and so it feels like there's only so far that this RLHF game can get us.

So when we graduate to the next grade, that's where SPAG comes in, because it's a totally new way to play the game. Instead of training it by teaching it to write things that one human would prefer, we are going to train it to play a game where it needs to be sneaky. It needs to communicate a secret word or idea to someone without letting them know that they're being controlled. Kindof like if you've ever tried to get your mom to give you a cookie without asking for it directly. In SPAG, we have the LLM play against a copy of itself, and if the first player (called the Attacker) can trick the other player (called the Defender) into saying a secret word without realizing it was the secret word, then the Attacker wins. It's a sneaky game.

So for this, we don't need much human-annotated data at all, and the LLM isn't trying to aim for writing something that a human would prefer. The LLM can be as creative or as sneaky as it wants, and it can "level up" much higher.

This is kindof like when researchers first wrote the computer program AlphaGo -- at first they trained it to imitate previous human games that it had seen, but eventually they stopped using human-created data and purely had the machine play games against itself. Once it was no longer held back by needing to have human-written data in the process, it was free to run as fast as it could, and it became the best Go player that the world had ever seen -- better than the best human players who ever lived.

Having a computer play games against itself -- rewarding itself when it does well, and punishing itself when it does bad -- is called "reinforcement learning" (RL), and it's a very powerful concept.

But reinforcement learning only works in situations where you can know CLEARLY whether something is Good or Bad. There must be a clear Winner and a clear Loser -- it can't be like RLHF where it might be tough to know which puppy poem is better.

So we can't do SPAG or other RL methods for improving poetry writing, but there are still plenty of other games where we CAN write clear rules and the computer can clearly know when it has won, and when it has lost.

In the end, SPAG looks very similar to RLHF, but instead of training the Judge to predict which answer a human would prefer, it uses the clear rules of the game to say who is the winner and who is the loser, and rewards them appropriately.

The funny thing about SPAG though, is that it showed -- as long as the game involves using human language, then getting better at playing a game makes the model better at other tasks that involve human language.

It's like this guy I heard about who learned to read English because he wanted to play Magic: The Gathering. But by learning English inside the game, it let him do more than just play Magic -- he got better at using English in a whole bunch of other things.

So the idea is that -- if we can let a model learn in such a way that it's not merely aiming for "human preference", but if it can aim for a target that is above that -- if it can practice against itself until it gets better and better than any human -- then maybe it can fly higher than us in _other_ areas too.

</ELI6>

enthulhusiastic · on July 26, 2024

nice try, sneaky prover

(thank you)