Hacker Newsnew | past | comments | ask | show | jobs | submit | usaar333's commentslogin

True, but it gets you higher accuracy. Gemini had the best aa-omniscience score

https://artificialanalysis.ai/evaluations/omniscience


i'd interpret that as rounding error. that is unchanged

swe-bench seems really hard once you are above 80%


it's not a great benchmark anymore... starting with it being python / django primarily... the industry should move to something more representative

Openai has; they don't even mention score on gpt-5.3-codex.

On the other hand, it is their own verified benchmark, which is telling.


In Quebec it was a 20% jump in mother employment: https://www.bloomberg.com/news/articles/2018-12-31/affordabl...

And had all sorts of negative outcomes for the kids: https://www.edweek.org/teaching-learning/long-term-study-of-...


claude 4.5 gets 82% on their own highly customized scaffolding. (parallel compute with a scoring function). That beats Doubao


That wasn't a ceasefire violation. It was a six week ceasefire that had expired at the beginning of March


Physics seems better than veo 3 at least from demo videos


Except it is sublinear. Sonnet 4 was 10.2% above sonnet 3.7 after 3 months.


We should all know that in the software world, the last 10% requires 90% of the effort!


Sublinear as demonstrated on a sigmoid scale is quite fast enough for me thank you.


No it doesn't. If it were even linear compared to o1 -> o3, we'd be at 2.43 hours. Instead we're only at 2.29.

Exponential would be at 3.6 hours


No, this is below expectations on both Manifold and lesswrong (https://www.lesswrong.com/posts/FG54euEAesRkSZuJN/ryan_green...). Median was ~2.75 hours on both (which already represented a bearish slowdown).

Not massively off -- manifold yesterday implied odds this low were ~35%. 30% before Claude Opus 4.1 came out which updated expected agentic coding abilities downward.


Thanks for sharing, that was a good thread!


At this point the prediction for SWE bench (85% by end of this month) is not materializing. We're actually quite far away.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: