It is interesting that the Gemini 3 beats every other model on these benchmarks, mostly by a wide margin, but not on SWE Bench. Sonnet is still king here and all three look to be basically on the same level. Kind of wild to see them hit such a wall when it comes to agentic coding
I think Anthropic is reading the room, and just going to go hard on being "the" coding model. I suppose they feel that if they can win that, they can get an ROI without having to do full blown multimodality at the highest level.
It's probably pretty liberating, because you can make a "spikey" intelligence with only one spike to really focus on.
Codex has been good enough to me and it’s much cheaper.
I code non-trivial stuff with it like multi-threaded code and at least for my style of AI coding which is to do fairly small units of work with multiple revisions it is good enough for me to not to even consider the competition.
Just giving you a perspective on how the benchmarks might not be important at all for some people and how Claude may have a difficult time being the definitive coding model.
Not to start a war but I've had 'fast' Claude write reams of slop code that I then have had to work with Codex to remove. Add this to the pile of "yeah but I saw the opposite with <insert model>" - but that's been my 2 cents.
Putting the latest Gemini CLI through some tough code tasks (C++) for my project, I found it to be slower than even Codex but good quality.
The problem I have is skepticism. Gemini 2.5 Pro was amazing on release, I couldn't stop talking about it. And then it went to being worthless in my workflows after a few months. I suspect Google (and other vendors) do this bait and switch with every release.
Claude can definitely write a lot of not-great code, but IME that's easy enough to mitigate by having it write a planning document first, then implement it step by step based on a to-do list on that planning document. Cursor's plan mode works great for this. It lets you review the outline at the start, then review each bit as the model writes it.
That said, I haven't had a good experience with Claude Code for the reason you described. Maybe it's Cursor (or similar IDE) that makes the difference.
My issue with codex is needing to run it in wsl in windows, due to it spamming confirmation requests for running even the safest of commands (eg list directory contents, read file, git status) which in turn adds an extra layer of complexity hooking it up via MCP to anything running in windows outside of wsl (like say figma)
In Claude on the other hand, MCP connections really do seem to ‘just work’
It remains to be seen whether that works out for them, but it seems like a good bet to me. Coding is the most monetizatable use anyone has found for LLMs so far, and the most likely to persist past this initial hype bubble (if the Singularity doesn't work out :p).
From my personal experience using the CLI agentic coding tools, I think gemini-cli is fairly on par with the rest in terms of the planning/code that is generated. However, when I recently tried qwen-code, it gave me a better sense of reasoning and structure that geimini. Claude definitely has it's own advantages but is expensive(at least for some if not for all).
My point is, although the model itself may have performed in benchmarks, I feel like there are other tools that are doing better just by adapting better training/tooling. Gemini cli, in particular, is not so great looking up for latest info on web. Qwen seemed to be trained better around looking up for information (or to reason when/how to), in comparision. Even the step-wise break down of work felt different and a bit smoother.
I do, however, use gemini cli for the most part just because it has a generous free quota with very few downsides comparted to others. They must be getting loads of training data :D.
Yeah, you can see this even by just running claude-code against other models. For example, DeepSeek used as a backend for CC tends to produce results mostly competitive with Sonnet 4.5 A lot is just in the tooling and prompting.
IMHO coding use cases are much more constrained by tooling than by raw model capabilities at the moment. Perhaps we have finally reached the time of diminishing returns and that will remain the case going forward.
This seems preferable. Wasting tokens on tools when a standardized, reliable interface to those tools should be all that's required.
The magic of LLMs is that they can understand the latent space of a problem and infer a mostly accurate response. Saying you need to subscribe to get the latest tools is just a sales tactic trained into the models to protect profits.
Not apples-to-apples. "Codex CLI (GPT-5.1-Codex)", which the site refers to, adds a specific agentic harness, whereas the Gemini 3 Pro seems to be on a standard eval harness.
It would be interesting to see the apples-to-apples figure, i.e. with Google's best harness alongside Codex CLI.
Yes, two things:
1. GPT-5.1 Codex is a fine tune, not the "vanilla" 5.1
2. More importantly, GPT 5.1 Codex achieves its performance when used with a specific tool (Codex CLI) that is optimized for GPT 5.1 Codex. But when labs evaluate the models, they have to use a standard tool to make the comparisons apples-to-apples.
Will be interesting to see what Google releases that's coding-specific to follow Gemini 3.
> But when labs evaluate the models, they have to use a standard tool to make the comparisons apples-to-apples.
That'd be a bad idea, models are often trained for specific tools (like GPT Codex is trained for Codex, and Sonnet has been trained with Claude Code in mind), and also vice-versa that the tools are built with a specific model in mind, as they all work differently.
Forcing all the models to use the same tool for execution sounds like a surefire way of getting results that doesn't represent real usage, but instead arbitrarily measure how well a model works with the "standard harness", which if people start caring about, will start to become gamed instead.
50% of the CLs in SWE-Bench Verified are the DJango codebase. So if you're a big contributor to Django you should care a lot about that benchmark. Otherwise the difference between models is +-2 tasks done correctly. I wouldn't worry too much about it. Just try it out yourself and see if its any better.
Their scores on SWE bench are very close because the benchmark is nearly saturated. Gemini 3 beats Sonnet 4.5 on TerminalBench 2.0 by a nice margin (54% vs. 43%), which is also agentic coding (CLI instead of python).
I find Gemini 2.5 pro to be as good or in some cases better for SQL than GPT 5.1. It's aging otherwise, but they must have some good SQL datasets in there for training.
You're right on SWE Bench Verified, I missed that and I'll delete my comment.
GPT 5.1 Codex beats Gemini 3 on Terminal Bench specifically on Codex CLI, but that's apples-to-oranges (hard to tell how much of that is a Codex-specific harness vs model). Look forward to seeing the apples-to-apples numbers soon, but I wouldn't be surprised if Gemini 3 wins given how close it comes in these benchmarks.