I asked both ChatGPT 4o and Claude 3.5 Sonnet how many letters there are in the word strawberry and both answered “There are two r’s in the word strawberry”. When I asked “are you sure?” ChatGPT listed the letters one by one and then said yes, there are indeed two. Claude apologized for the mistake and said the correct answer is one.
If the LLM cannot even solve such a simple question, something a young child can do, and confidently gives you incorrect answers, then I’m not sure how someone could possibly trust it for complex tasks like programming.
I’ve used them both for programming and have had mixed results. The code is always mediocre at BEST but downright wrong and buggy at worst. You must review and understand everything it writes. Sometimes it’s worth iteratively getting it to generate stuff and you fix it or tell it what to fix, but often I’m far quicker just doing it myself.
That’s not to say that it isn’t useful. It’s great as a tool to augment learning from documentation. It’s great at making pros and cons lists. It’s great as a rubber duck. It can be helpful to set you on a path by giving some code snippets or examples. But the code it generates should NEVER be used verbatim without review and editing, at best it’s a throwaway proof of concept.
I find them useful, but the thoughts that people use them as an alternative to knowing how to program or thinking about the problem themselves, that scares me.
Sorry, but this 'benchmark question' really isn't all that useful. Asking an LLM questions that can only be answered at the letter level is like asking somebody who is red-green colorblind questions that can only be answered at the red-green level. LLMs are trained by first splitting text into tokens that comprise multiple letters, they never 'see' individual letters.
The 'confidently answering with a wrong solution' aspect is of course still a valuable insight, and yes, you need to double-check any answer you've received from an LLM. But if you've never tried GitHub Copliot, I can recommend doing so. I'd be surprised if it doesn't manage to surprise you. For me it was actually really useful to get those parts of code out of the way that are essentially just an 'exercise in typing', once you've written a comment explaining the idea. (It's also very useful to have a shortcut to quickly turn off its completions, because otherwise you end up spending more time reading through its suggestions than actual coding, in situations where you know it won't come up with the right answer.)
When asked to prove it, it spelled out the letters one by one, and still failed (ChatGPT asserted the answer is still 2, Claude “corrected” itself to 1). Only when forcing it to place a count beside each letter did it get it correct.
It’s not really about the specific question, that just highlights that it does not have the ability to comprehend and reason. It’s a prediction machine.
If it cannot decompose such a simple problem, then how can it possibly get complex programming problems that cannot be simply pattern matched to a solution correct? My experience with ChatGPT, Claude, and copilot writing code demonstrates this. It often generates code that on the surface level looks correct, but when tested it either fails outright or subtly fails.
Even things like CSS it gets wrong, producing output that on the surface seems to do what you asked but in fact doesn’t actually style it correctly at all.
Its lack of ability to understand, decompose, and reason is the problem. The fact that it’s so confident even when wrong is the problem. The fact that it cannot detect when it doesn’t know is the problem.
It generates text that has high probability of “looking” correct, not text that has a high probability of being correct. With simple questions like the one I posed, it’s obvious to us when it gets it wrong. With complex programming tasks, the solution is complex enough that it often takes significant effort to determine if it’s correct or wrong. There’s more room for it to “look” correct without “being” correct.
> But if you've never tried GitHub Copliot
I’ve used it for almost a year before I cancelled my subscription because it wasn’t adding much value. I found copilot chat a bit more useful, but ChatGPT was good enough for that. I still use ChatGPT when programming: as a tool to help with documentation (what’s the react function to do X, type questions), to rubber duck, to ask for pros and cons lists on ideas or approaches, and to get starting points. But never to write the code for me, at least not without the expectation of significant rewriting, unless it’s super trivial (but then I likely would have written it faster myself anyway).
Thanks for taking the time to answer so thoroughly :)
In that case I stand corrected, I'd just assumed you hadn't used Copilot because, to me, it was so more effective at aiding with programming that ChatGPT. But I suspect that very much depends on the use-case. I liked it a lot for e.g. writing numpy code, where I'd have had to look up the documentation on every function otherwise, or for writing database migrations by hand, where the patterns are very clear, and in those situations it felt like a huge time-saver. But for other applications it didn't help at all, or admittedly even introduced subtle bugs that were fun to find and fix.
After my free year of Copilot ran out I also didn't re-subscribe, because at this point I have too many AI-related subscriptions as it stands, but I'd definitely (carefully) use it if I had access to it via an org or an OS project.
To be completely fair, there are some things I did have success with getting code generated. For example, I made a little python script to pull fields out of TOML files and converted them to CSV (so that I could import the data into a spreadsheet). It did mostly ok on this (in that I didn’t have to edit the final code that much and it was in fact faster than writing it all myself).
But the cases where I find its code was good enough are 1) fairly easy tasks (ie I don’t need AI to do it, but it still saved some time), and 2) not that common for the type of development I’ve been mostly doing. The problem is that I’ve often wasted significant time to figure out whether or not it’s one of these tasks, so in the long run it just doesn’t feel that useful to me as a “write code for me” tool. But as I said, I do find AI a useful aid, just not to write my code for me.
If the LLM cannot even solve such a simple question, something a young child can do, and confidently gives you incorrect answers, then I’m not sure how someone could possibly trust it for complex tasks like programming.
I’ve used them both for programming and have had mixed results. The code is always mediocre at BEST but downright wrong and buggy at worst. You must review and understand everything it writes. Sometimes it’s worth iteratively getting it to generate stuff and you fix it or tell it what to fix, but often I’m far quicker just doing it myself.
That’s not to say that it isn’t useful. It’s great as a tool to augment learning from documentation. It’s great at making pros and cons lists. It’s great as a rubber duck. It can be helpful to set you on a path by giving some code snippets or examples. But the code it generates should NEVER be used verbatim without review and editing, at best it’s a throwaway proof of concept.
I find them useful, but the thoughts that people use them as an alternative to knowing how to program or thinking about the problem themselves, that scares me.