> Fine-grained recognition. We found that humans are noticeably worse at fine-grained recognition (e.g. dogs, monkeys, snakes, birds), even when they are in clear view. To understand the difficulty, consider that there are more than 120 species of dogs in the dataset. We estimate that 28 (37%) of the human errors fall into this category, while only 7 (7%) of GoogLeNet erros do.
This is an interesting observation. It also makes claims of near-human-level performance somewhat suspect.
I don't get how you can understand LLMs enough to confidently say stuff like this, but not understand how many different ways the article is eyerollingly stupid:
They're conflating ChatGPT the website with the underlying model, the former of which uses a system prompt that changes significantly over time, completely independently of AI alignment. Their recent custom system prompt change confirms what everyone suspected: they've been running around like chickens without heads trying to tweak that prompt to make everyone happy, but you can never have a default that achieves that.
It also uses summarization to enable long chats... sometimes causing lay people to claim it got worse or forgot how to do X in a single conversation when really their original instructions have long left the context window.
-
And the fact they're judging it on it's ability to do "basic math" in the context window when the only actual update to the underlying model was centered around making function calling more reliable...
I mean the code interpreter is now live, it makes ChatGPT brilliant at basic math and a hell of a lot more than that. Basic math isn't basic for an attention based model.
And it can look bad for the site or the person at first glance unless someone bothers to investigate--and instead they'll often just take their first impression, shrug, and move on. (Oh wasn't that the researcher who was somehow somehow involved in some plagiarism thing a few years back?)
https://www.f-lohmueller.de/links/index_re.htm