You would only expect that if you actually think this is a sentient system. It's fascinating that we already take this for granted and casually judge it like a human.
You might think that’s obvious, but it isn’t to everybody. I am currently dealing with managers at work who think that chatgpt can already replace human judgment. Not in programming tasks, but in softer tasks like summarization. They think it can detect sarcasm, that it knows which speakers are experts and which are not, that it will avoid misconstruing opinions as fact, that it will avoid libel, that it will avoid endorsing books it hasn’t read, that it will reliably detect text that it is incapable of summarizing, etc, etc.
GPT-4 is on average better than humans, and within the top 10% on many tasks. Almost all the issues you mentioned are being addressed. It's good enough to be admitted in many colleges. To paraphrase Greg Brockman - "it's not perfect, but neither are you".
One interesting trick you can do with LLMs is to require an ensemble of solutions sampled at higher temperature, and choose the most consistent answer. This boosts the model by a huge margin and provides confidence scores. So I see it improving by external means.
>> GPT-4 is on average better than humans, and within the top 10% on many tasks.
Sorry for adding to the dogpile, but that is a big claim without anything to back it up. It's been clear for a while now that we have no objective measure of the ability of language models. See, for example, the discussion in this article:
What Will it Take to Fix Benchmarking in Natural Language Understanding?
Especially so with the shift to large language models that can only realistically be created by big corporations with vast resources and no interest in scientific questions, the whole business of language modelling has escaped from the confines of science. We just have no way to know how good, or bad, those models really are. All we have is nothing more than feelings: the subjective opinion of people who use them, and then tweet about them.
So let's just not make such grandiose claims, right? As the authors of the paper above point out:
Performance on popular benchmarks is extremely high, but experts can easily find issues with high-scoring models. (...) Ample evidence has emerged that the systems that have topped these leaderboards can fail dramatically on simple test cases that are meant to test the very skills that the leaderboards focus on (McCoy et al., 2019; Ribeiro et al., 2020). This result makes it clear that our systems have significant
room to improve.