To @simonw and all the coding agent and LLM benchmarkers out there: please, alwa...

simonw · 2025-09-29T22:05:47 1759183547

That's a good call, I'll try to remember that for next time.

Imustaskforhelp · 2025-09-29T22:15:02 1759184102

I just wanted to say that I really liked your this comment which just showed professionalism and just learning from your mistakes/improving yourself.

I definitely consider you to be an AI influencer, especially in hackernews communities and so I wanted to say that I see influencers who will double down,triple down on things when in reality, people just wanted to help them in the first place.

I just wanted to say thanks with all of this in mind, also that your generate me a pelican riding a bicycle has been a fun ride and is always going to be interesting, so thanks for that as well. I just wanted to share my gratitude with ya.

typpilol · 2025-09-29T22:25:40 1759184740

Have you thought about benchmarking models a month or two after release to see how it competes vs the day 1 release

simonw · 2025-09-29T22:58:58 1759186738

For that to be useful I'd need to be running much better benchmarks - anything less than a few hundred numerically scored tasks would be unlikely to reliably identity differences.

An organization like Artificial Analysis would be a better fit for that kind of investigation: https://artificialanalysis.ai/

westurner · 2025-09-30T02:36:13 1759199773

Manually,

From https://news.ycombinator.com/item?id=40859434 :

> E.g promptfoo and chainforge have multi-LLM workflows.

> Promptfoo has a YAML configuration for prompts, providers,: https://www.promptfoo.dev/docs/configuration/guide/

openai/evals//docs/build-eval.md: https://github.com/openai/evals/blob/main/docs/build-eval.md

From https://news.ycombinator.com/item?id=45267271 ;

> API facades like OpenLLM and model routers like OpenRouter have standard interfaces for many or most LLM inputs and outputs. Tools like Promptfoo, ChainForge, and LocalAI also all have abstractions over many models.

> What are the open standards for representing LLM inputs, and outputs?

> W3C PROV has prov:Entity, prov:Activity, and prov:Agent for modeling AI provenance: who or what did what when.

> LLM evals could be represented in W3C EARL Evaluation and Reporting Language

"Can Large Language Models Emulate Judicial Decision-Making? [Paper]" https://news.ycombinator.com/item?id=42927611

"California governor signs AI transparency bill into law" (2025) https://news.ycombinator.com/item?id=45418428 :

> https://sb53.info/

Is this the first of its sort?:

> CalCompute

fabmilo · 2025-09-29T21:47:39 1759182459

Yeah I totally agree, we need time to completion of each step and the number of steps, sizes of prompts, number of tools, ... and better visualization of each run and break down based on the difficulty of the task

nojs · 2025-09-30T04:12:20 1759205540

This is very relevant to this release. It’s way faster, but also seems lazier and more likely to say something’s done when it isn’t (at least in CC). On net it feels more productive because all the small “more padding” prompts are lightning fast, and the others you can fix.