Note that it's two different things: This OP claims the publicly available model...

sigmoid10 · 2025-07-19T15:15:53 1752938153

I'd also be highly wary of the method they used because of statements like this:

>we note that the vast majority of its answers simply stated the final answer without additional justification

While the reasoning steps are obviously important for judging human participant answers, none of the current big-game providers disclose their actual reasoning tokens. So unless they got direct internal access to these models from the big companies (which seems highly unlikely), this might be yet another failed study designed to (of which we have seen several in recent months, even by serious parties).

dmitrygr · 2025-07-19T15:08:41 1752937721

My (unreleased) cat did even better than the OpenAI model. No you cannot see. Yes you have to trust me. Now gimme more money.

klabb3 · 2025-07-19T15:18:47 1752938327

Wow, that’s incredible. Cats are progressing so fast, especially unreleased cats seem to be doing much better. My two orange kitties aren’t doing well on math problems but obviously that’s because I’m not prompting the right way – any day now. If I ever get it to work, I’ll be sure to share the achievements on X, while carefully avoiding explaining how I did it or provide any data that can corroborate the claims.

raincole · 2025-07-19T15:15:15 1752938115

I don't know the details (of course, it's unreleased), but note that MathArena evaluated "average of 4 attempts", and limited token usages to 64k.

OpenAI likely had unlimited tokens, and evaluated "best of N attempts."

amelius · 2025-07-19T15:15:55 1752938155

That's a claim that is far less plausible. OpenAI could have thrown more resources at the problem and I would be surprised if that didn't improve the results.

bgwalter · 2025-07-19T15:39:01 1752939541

The model did not fit in the margin.

We'll never know how many GPUs and other assistance (like custom code paths) this model got.