> than you would expect from an algorithm that just predicts the next word.
I think there is common mistake in this concept of just predicting the next word. While it is true that just the next word is predicted, a good way to do that is to internally imagine more than the next word and then just spit out the next word. Of course with the word after that the process repeats with a new imagination.
One may say that this is not what it does and I would say, show me that this is not exactly what the learned state does. Even if the following words are never constructed anywhere, they can be implied in the computation.
The say this differently, what we think is just the next word is actually the continuation that then manifests as a single word. This would remain true even if, in fact, the task is to only predict the next word. Which is to say that the next word is actually more than what it sounds.
I think there is common mistake in this concept of just predicting the next word. While it is true that just the next word is predicted, a good way to do that is to internally imagine more than the next word and then just spit out the next word. Of course with the word after that the process repeats with a new imagination.
One may say that this is not what it does and I would say, show me that this is not exactly what the learned state does. Even if the following words are never constructed anywhere, they can be implied in the computation.
The say this differently, what we think is just the next word is actually the continuation that then manifests as a single word. This would remain true even if, in fact, the task is to only predict the next word. Which is to say that the next word is actually more than what it sounds.