https://artificialanalysis.ai/evaluations/omniscience
reply
swe-bench seems really hard once you are above 80%
On the other hand, it is their own verified benchmark, which is telling.
And had all sorts of negative outcomes for the kids: https://www.edweek.org/teaching-learning/long-term-study-of-...
Exponential would be at 3.6 hours
Not massively off -- manifold yesterday implied odds this low were ~35%. 30% before Claude Opus 4.1 came out which updated expected agentic coding abilities downward.
https://artificialanalysis.ai/evaluations/omniscience
reply