Interesting that they used Chain of Thought Prompting[1] for improved reasoning ...

make3 · on April 4, 2022

The chain of thought paper is from Google, so they've known about it internally for a while potentially

nullc · on April 4, 2022

The general technique is pretty obvious, I discussed and demonstrated it in some HN comments with GPT2 and GPT3 a couple times in the last couple years, and suggested some speculative extensions (which might be totally unworkable, unfortunately these networks are too big for me to attempt to train to try it out) https://news.ycombinator.com/item?id=24005638

gwern · on April 4, 2022

In fact, people had already shown it working with GPT-3 before you wrote your comment: https://twitter.com/kleptid/status/1284069270603866113 https://twitter.com/kleptid/status/1284098635689611264 Seeing how much smarter it could be with dialogue was very exciting back then, when people were still super-skeptical.

The followup work has also brought out a lot of interesting points: why didn't anyone get that working with GPT-2, and why wouldn't your GPT-2 suggestion have worked? Because inner-monologue capabilities seem to only emerge at some point past 100b-parameters (and/or equivalent level of compute), furnishing one of the most striking examples of emergent capability-spikes in large NNs. GPT-2 is just way too small, and if you had tried, you would've concluded inner-monologue doesn't work. It doesn't work, and it keeps on not working... until suddenly it does work.

bglazer · on April 5, 2022

Is there any convincing research for how/why the inner monologue capabilities emerge?

It’s extremely unintuitive, but also pretty empirically obvious, that LLM’s gain this capability just by scaling and absent any changes in architecture. I assumed that an explicit external memory would be needed, maybe similar to a neural turing machine.

gwern · on April 5, 2022

There is none I am aware of. It all focuses on eliciting and measuring and making good use of the capability.

The lack of an explicit external memory is not too surprising because the text is fed back in at every iteration. That fakes having a memory: the prompt just gets bigger. That's ordinary enough. What's critical, it seems, is being able to decide on the next incremental step and executing it within the space of an iteration, rather than simply 'guessing' the final answer.

As to how that actually happens inside a large but not small Transformer, I suspect that there is a phase transition inside the Transformer itself where it changes how it fundamentally thinks, which doesn't lead to any obvious changes the training dynamics because the two ways of thinking are initially equivalent in a loss. An example of this, where the Transformer computes in a radically different way before and after a certain point in training, is Anthropic's new work on the "induction bump": https://transformer-circuits.pub/2022/in-context-learning-an...