Models are still leapfrogging each other every month in e.g. coding or research capability, or even in more mundane tasks such as summerizing long multi topic texts.
Depending on which side of the issue you fall, you're hoping this will go on for a long time to come, or praying that it will end asap.
I'm not using the cheapest in neither my own support, nor in my production systems.
from what I've seen the "leapfrogging" is very very incremental.
They all seems to racing to the plateau... It doesn't look like there will ever be a "stand out" leader and the product that each company presents to the market appears to be essentially the same product that everyone else presents to the market. Maybe with some slight twist to it that is easily recruitable or exceedable within a few months.
This is the issue really. at some point the investors are all going to realize that non of their investments are going to be market leaders. When they get to that stage the bubble will well and truly pop.
To me it feels there is no plateau and the models are already very useful and impactful.
I believe there is no plateau because there is nothing objectively special or magical about the human mind and it all can and will be eventually solved, one hack at a time.
There seems to be some part of the LLM capabilities being lost by hacking some benchmarks.
Claude 3.7 is a great example of a model clearly beating 3.5 in all benchmarks, but slowly destroying my code base by adding lots of extra lines or hacking around my instructions (adding ,,if'' statements when I want it to change the code to handle a case instead of understanding what change is really needed to be done for it).
I still prefer o1 pro and a lot of those leapfrogging in benchmarks don't translate to being smarter anymore.
Adoption is critical for these LLM corporations, because unlike in other industries, here free tier users incur almost the same costs as the paid tier users. They really can't degrade free tier experience too much, or their customers will flee to the competitors. I've read one guy calculating expenses of these corpos and they are truly insane by now and are constantly rising.
Being close to the edge of AI usage, it's important to realize that most AI use cases are not "fully autonomous AI software engineer" or "deep research into a niche topic" but way more innocuous: Improve my blog post, what's the capital of France, what are some nice tourist sites to see around my next vacation destination.
For those non-edge use cases, costs are an issue, but so are inertia and switching costs. A big reason OpenAI and ChatGPT are so huge is that it's still their go-to model for all of these non-edge use cases as it's well known, well adopted, and quite frankly very efficiently priced.
You don't have to create a real software engineer, you just have to create one that looks close enough to get some executive his bonus and won't fall over before he's moved on to another company...
Yes, there are differences between the models, and yes some may work better.
But picking the model at this point is just picking the cheapest option. For most use cases any model will do.