To @simonw and all the coding agent and LLM benchmarkers out there: please, always publish the elapsed time for the task to complete successfully! I know this was just a "it works straight in claude.ai" post, but still, nowhere in the transcript there's a timestamp of any kind. Durations seem to be COMPLETELY missing from the LLM coding leaderboards everywhere [1] [2] [3]
There's a huge difference in time-to-completion from model to model, platform to platform, and if, like me, you are into trial-and-error, rebooting the session over and over to get the prompt right or "one-shot", it's important how reasoning efforts, provider's tokens/s, coding agent tooling efficiency, costs and overall model intelligence play together to get the task done. Same thing applies to the coding agent, when applicable.
Grok Code Fast and Cerebras Code (qwen) are 2 examples of how models can be very competitive without being the top-notch intelligence. Running inference at 10x speed really allows for a leaner experience in AI-assisted coding and more task completion per day than a sluggish, but more correct AI. Darn, I feel like a corporate butt-head right now.
I just wanted to say that I really liked your this comment which just showed professionalism and just learning from your mistakes/improving yourself.
I definitely consider you to be an AI influencer, especially in hackernews communities and so I wanted to say that I see influencers who will double down,triple down on things when in reality, people just wanted to help them in the first place.
I just wanted to say thanks with all of this in mind, also that your generate me a pelican riding a bicycle has been a fun ride and is always going to be interesting, so thanks for that as well. I just wanted to share my gratitude with ya.
For that to be useful I'd need to be running much better benchmarks - anything less than a few hundred numerically scored tasks would be unlikely to reliably identity differences.
An organization like Artificial Analysis would be a better fit for that kind of investigation: https://artificialanalysis.ai/
> API facades like OpenLLM and model routers like OpenRouter have standard interfaces for many or most LLM inputs and outputs. Tools like Promptfoo, ChainForge, and LocalAI also all have abstractions over many models.
> What are the open standards for representing LLM inputs, and outputs?
> W3C PROV has prov:Entity, prov:Activity, and prov:Agent for modeling AI provenance: who or what did what when.
> LLM evals could be represented in W3C EARL Evaluation and Reporting Language
Yeah I totally agree, we need time to completion of each step and the number of steps, sizes of prompts, number of tools, ... and better visualization of each run and break down based on the difficulty of the task
This is very relevant to this release. It’s way faster, but also seems lazier and more likely to say something’s done when it isn’t (at least in CC). On net it feels more productive because all the small “more padding” prompts are lightning fast, and the others you can fix.
There's a huge difference in time-to-completion from model to model, platform to platform, and if, like me, you are into trial-and-error, rebooting the session over and over to get the prompt right or "one-shot", it's important how reasoning efforts, provider's tokens/s, coding agent tooling efficiency, costs and overall model intelligence play together to get the task done. Same thing applies to the coding agent, when applicable.
Grok Code Fast and Cerebras Code (qwen) are 2 examples of how models can be very competitive without being the top-notch intelligence. Running inference at 10x speed really allows for a leaner experience in AI-assisted coding and more task completion per day than a sluggish, but more correct AI. Darn, I feel like a corporate butt-head right now.
1. https://www.swebench.com/
2. https://www.tbench.ai/leaderboard
3. https://gosuevals.com/agents.html