Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

To @simonw and all the coding agent and LLM benchmarkers out there: please, always publish the elapsed time for the task to complete successfully! I know this was just a "it works straight in claude.ai" post, but still, nowhere in the transcript there's a timestamp of any kind. Durations seem to be COMPLETELY missing from the LLM coding leaderboards everywhere [1] [2] [3]

There's a huge difference in time-to-completion from model to model, platform to platform, and if, like me, you are into trial-and-error, rebooting the session over and over to get the prompt right or "one-shot", it's important how reasoning efforts, provider's tokens/s, coding agent tooling efficiency, costs and overall model intelligence play together to get the task done. Same thing applies to the coding agent, when applicable.

Grok Code Fast and Cerebras Code (qwen) are 2 examples of how models can be very competitive without being the top-notch intelligence. Running inference at 10x speed really allows for a leaner experience in AI-assisted coding and more task completion per day than a sluggish, but more correct AI. Darn, I feel like a corporate butt-head right now.

1. https://www.swebench.com/

2. https://www.tbench.ai/leaderboard

3. https://gosuevals.com/agents.html



That's a good call, I'll try to remember that for next time.


I just wanted to say that I really liked your this comment which just showed professionalism and just learning from your mistakes/improving yourself.

I definitely consider you to be an AI influencer, especially in hackernews communities and so I wanted to say that I see influencers who will double down,triple down on things when in reality, people just wanted to help them in the first place.

I just wanted to say thanks with all of this in mind, also that your generate me a pelican riding a bicycle has been a fun ride and is always going to be interesting, so thanks for that as well. I just wanted to share my gratitude with ya.


Have you thought about benchmarking models a month or two after release to see how it competes vs the day 1 release


For that to be useful I'd need to be running much better benchmarks - anything less than a few hundred numerically scored tasks would be unlikely to reliably identity differences.

An organization like Artificial Analysis would be a better fit for that kind of investigation: https://artificialanalysis.ai/


Manually,

From https://news.ycombinator.com/item?id=40859434 :

> E.g promptfoo and chainforge have multi-LLM workflows.

> Promptfoo has a YAML configuration for prompts, providers,: https://www.promptfoo.dev/docs/configuration/guide/

openai/evals//docs/build-eval.md: https://github.com/openai/evals/blob/main/docs/build-eval.md

From https://news.ycombinator.com/item?id=45267271 ;

> API facades like OpenLLM and model routers like OpenRouter have standard interfaces for many or most LLM inputs and outputs. Tools like Promptfoo, ChainForge, and LocalAI also all have abstractions over many models.

> What are the open standards for representing LLM inputs, and outputs?

> W3C PROV has prov:Entity, prov:Activity, and prov:Agent for modeling AI provenance: who or what did what when.

> LLM evals could be represented in W3C EARL Evaluation and Reporting Language

"Can Large Language Models Emulate Judicial Decision-Making? [Paper]" https://news.ycombinator.com/item?id=42927611

"California governor signs AI transparency bill into law" (2025) https://news.ycombinator.com/item?id=45418428 :

> https://sb53.info/

Is this the first of its sort?:

> CalCompute


Yeah I totally agree, we need time to completion of each step and the number of steps, sizes of prompts, number of tools, ... and better visualization of each run and break down based on the difficulty of the task


This is very relevant to this release. It’s way faster, but also seems lazier and more likely to say something’s done when it isn’t (at least in CC). On net it feels more productive because all the small “more padding” prompts are lightning fast, and the others you can fix.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: