More

usaar333 · 2026-02-10T07:11:12 1770707472

True, but it gets you higher accuracy. Gemini had the best aa-omniscience score

https://artificialanalysis.ai/evaluations/omniscience

usaar333 · 2026-02-05T17:59:27 1770314367

i'd interpret that as rounding error. that is unchanged

swe-bench seems really hard once you are above 80%

Squarex · 2026-02-05T18:04:43 1770314683

it's not a great benchmark anymore... starting with it being python / django primarily... the industry should move to something more representative

usaar333 · 2026-02-05T18:17:27 1770315447

Openai has; they don't even mention score on gpt-5.3-codex.

On the other hand, it is their own verified benchmark, which is telling.

usaar333 · 2025-11-22T17:48:39 1763833719

In Quebec it was a 20% jump in mother employment: https://www.bloomberg.com/news/articles/2018-12-31/affordabl...

And had all sorts of negative outcomes for the kids: https://www.edweek.org/teaching-learning/long-term-study-of-...

usaar333 · 2025-11-18T17:39:44 1763487584

claude 4.5 gets 82% on their own highly customized scaffolding. (parallel compute with a scoring function). That beats Doubao

usaar333 · 2025-10-02T20:09:25 1759435765

That wasn't a ceasefire violation. It was a six week ceasefire that had expired at the beginning of March

usaar333 · 2025-09-30T19:41:50 1759261310

Physics seems better than veo 3 at least from demo videos

usaar333 · 2025-09-29T19:39:59 1759174799

Except it is sublinear. Sonnet 4 was 10.2% above sonnet 3.7 after 3 months.

GoatInGrey · 2025-09-29T20:10:15 1759176615

We should all know that in the software world, the last 10% requires 90% of the effort!

baq · 2025-09-30T06:20:51 1759213251

Sublinear as demonstrated on a sigmoid scale is quite fast enough for me thank you.

usaar333 · 2025-08-08T00:04:23 1754611463

No it doesn't. If it were even linear compared to o1 -> o3, we'd be at 2.43 hours. Instead we're only at 2.29.

Exponential would be at 3.6 hours

usaar333 · 2025-08-07T18:33:28 1754591608

No, this is below expectations on both Manifold and lesswrong (https://www.lesswrong.com/posts/FG54euEAesRkSZuJN/ryan_green...). Median was ~2.75 hours on both (which already represented a bearish slowdown).

Not massively off -- manifold yesterday implied odds this low were ~35%. 30% before Claude Opus 4.1 came out which updated expected agentic coding abilities downward.

qsort · 2025-08-07T18:41:03 1754592063

Thanks for sharing, that was a good thread!

usaar333 · 2025-08-07T17:12:40 1754586760

At this point the prediction for SWE bench (85% by end of this month) is not materializing. We're actually quite far away.