Totally agree with this. I have seen many cases where a dumber model gets trapped in a local minima and burns a ton of tokens to escape from it (sometimes unsuccessfully). In a toy example (30 minute agentic coding session - create a markdown -> html compiler using a subset of commonmark test suite to hill climb on), dumber models would cost $18 (at retail token prices) to complete the task. Smarter models would see the trap and take only $3 to complete the task. YMMV.
Much better to look at cost per task - and good to see some benchmarks reporting this now.
For me this is sub agent usage. If I ask Claude Code to use 1-3 subagents for a task, the 5 hour limit is gone in one or two rounds. Weekly limit shortly after. They just keep producing more and more documentation about each individual intermediate step to talk to each other no matter how I edit the sub agent definitions.
Care sharing some of your sub-agent usage? I've always intended to really make use of them, but with skills, I don't know how I'd separate these in many use cases?
Had to modify them a bit, mostly taking out the parts I didn’t want them doing instead of me. Sometimes they produced good results but mostly I found that they did just as well as the main agent while being way more verbose. A task to do a big hunt or to add a backend and frontend feature using two agents at once could result in 6-8 sizable Markdown documents.
Typically I find that just adding “act as a Senior Python engineer with experience in asyncio” or some such to be nearly as good.
They're useful for context management. I use frequently for research in a codebase, looking for specific behavior, patterns, etc. That type of thing eats a lot of context because a lot of data needs to be ingested and analyzed.
If you delegate that work to a sub-agent, it does all the heavy lifting, then passes the results to the main agent. The sub-agent's context is used for all the work, not the main agent's.
Hard agree. The hidden cost of 'cheap' models is the complexity of the retry logic you have to write around them.
If a cheaper model hallucinates halfway through a multi-step agent workflow, I burn more tokens on verification and error correction loops than if I just used the smart model upfront. 'Cost per successful task' is the only metric that matters in production.
The context usage awareness is a bit boost for this in my experience. I use speckit and have setup to wrap up tasks when at least 20% of context remaining with a summary of progress, followed by /clear, insert summary and continue. This has reduced compacts almost entirely.
Here are some early rough numbers from our own internal usage on the Amp team (avg cost $ per thread):
- Sonnet 4.5: $1.83
- Opus 4.5: $1.30 (earlier checkpoint last week was $1.55)
- Gemini 3 Pro: $1.21
Cost per token is not the right way to look at this. A bit more intelligence means mistakes (and wasted tokens) avoided.