I said that the context window improved. I mean that it is larger. GPT-3.5 is 4k tokens, GPT-4 is 8k tokens (standard) or 32k tokens (only API access atm). This is the number of tokens that GPT-X can take into account when producing a response.
Specifically, I was using this to support the statement "In very near future your IDE will send the whole codebase as context to LLMs." I'm not talking about loops or accuracy.
It's true, but there is no indication that GPT can explain larger concepts for you, and negative indication it will be able to do it accurately.
It can't even explain small code to me unless it is something that it has been trained on. Often it gets even simple things wrong, either obviously, or worse, subtly wrong.
I agree that this is the part that needs more work, and is most uncertain. Increasing context windows seems like a fairly straightforward computational challenge (albeit potentially expensive). On the other hand, whether or not we can scale current models towards "true understanding" (or similar), is a total unknown atm.
I still think we will get useful things from scaling up current models though. I've already got a lot of value out of Copilot, for instance, and I'm looking forward to the next version based on GPT-4. Recently, I've been using the GPT-3 Copilot to write a lot of pandas/matplotlib code, which is fairly straightforward and repetitive, but as mainly a Java developer, I just don't have the APIs at my fingertips. Copilot helps a lot with this sort of thing.