It is interesting that the Gemini 3 beats every other model on these benchmarks,...

Workaccount2 · 2025-11-18T14:57:40 1763477860

I think Anthropic is reading the room, and just going to go hard on being "the" coding model. I suppose they feel that if they can win that, they can get an ROI without having to do full blown multimodality at the highest level.

It's probably pretty liberating, because you can make a "spikey" intelligence with only one spike to really focus on.

aerhardt · 2025-11-18T18:17:53 1763489873

Codex has been good enough to me and it’s much cheaper.

I code non-trivial stuff with it like multi-threaded code and at least for my style of AI coding which is to do fairly small units of work with multiple revisions it is good enough for me to not to even consider the competition.

Just giving you a perspective on how the benchmarks might not be important at all for some people and how Claude may have a difficult time being the definitive coding model.

enraged_camel · 2025-11-18T21:37:44 1763501864

>> Codex has been good enough to me and it’s much cheaper.

It may be cheaper but it's much, much slower, which is a total flow killer in my experience.

dudeinhawaii · 2025-11-19T18:11:45 1763575905

Not to start a war but I've had 'fast' Claude write reams of slop code that I then have had to work with Codex to remove. Add this to the pile of "yeah but I saw the opposite with <insert model>" - but that's been my 2 cents.

Putting the latest Gemini CLI through some tough code tasks (C++) for my project, I found it to be slower than even Codex but good quality.

The problem I have is skepticism. Gemini 2.5 Pro was amazing on release, I couldn't stop talking about it. And then it went to being worthless in my workflows after a few months. I suspect Google (and other vendors) do this bait and switch with every release.

Let me see the benchmarks in 3 months.

enraged_camel · 2025-11-19T18:22:35 1763576555

Claude can definitely write a lot of not-great code, but IME that's easy enough to mitigate by having it write a planning document first, then implement it step by step based on a to-do list on that planning document. Cursor's plan mode works great for this. It lets you review the outline at the start, then review each bit as the model writes it.

That said, I haven't had a good experience with Claude Code for the reason you described. Maybe it's Cursor (or similar IDE) that makes the difference.

mock-possum · 2025-11-19T16:21:47 1763569307

My issue with codex is needing to run it in wsl in windows, due to it spamming confirmation requests for running even the safest of commands (eg list directory contents, read file, git status) which in turn adds an extra layer of complexity hooking it up via MCP to anything running in windows outside of wsl (like say figma)

In Claude on the other hand, MCP connections really do seem to ‘just work’

htrp · 2025-11-18T15:10:53 1763478653

more playing to their strengths. a giant chunk of their usage data is basically code gen

Miraste · 2025-11-18T17:52:46 1763488366

It remains to be seen whether that works out for them, but it seems like a good bet to me. Coding is the most monetizatable use anyone has found for LLMs so far, and the most likely to persist past this initial hype bubble (if the Singularity doesn't work out :p).

vharish · 2025-11-18T13:34:44 1763472884

From my personal experience using the CLI agentic coding tools, I think gemini-cli is fairly on par with the rest in terms of the planning/code that is generated. However, when I recently tried qwen-code, it gave me a better sense of reasoning and structure that geimini. Claude definitely has it's own advantages but is expensive(at least for some if not for all).

My point is, although the model itself may have performed in benchmarks, I feel like there are other tools that are doing better just by adapting better training/tooling. Gemini cli, in particular, is not so great looking up for latest info on web. Qwen seemed to be trained better around looking up for information (or to reason when/how to), in comparision. Even the step-wise break down of work felt different and a bit smoother.

I do, however, use gemini cli for the most part just because it has a generous free quota with very few downsides comparted to others. They must be getting loads of training data :D.

xnx · 2025-11-18T15:26:23 1763479583

Gemini CLI is moving really fast. Noticeable improvements in features and functionality every week.

cmrdporcupine · 2025-11-19T15:30:10 1763566210

Yeah, you can see this even by just running claude-code against other models. For example, DeepSeek used as a backend for CC tends to produce results mostly competitive with Sonnet 4.5 A lot is just in the tooling and prompting.

felipeerias · 2025-11-18T13:29:25 1763472565

IMHO coding use cases are much more constrained by tooling than by raw model capabilities at the moment. Perhaps we have finally reached the time of diminishing returns and that will remain the case going forward.

_factor · 2025-11-18T13:54:16 1763474056

This seems preferable. Wasting tokens on tools when a standardized, reliable interface to those tools should be all that's required.

The magic of LLMs is that they can understand the latent space of a problem and infer a mostly accurate response. Saying you need to subscribe to get the latest tools is just a sales tactic trained into the models to protect profits.

Palmik · 2025-11-18T13:18:41 1763471921

Also does not beat GPT-5.1 Codex on terminal bench (57.8% vs 54.2%): https://www.tbench.ai/

I did not bother verifying the other claims.

HereBePandas · 2025-11-18T13:27:02 1763472422

Not apples-to-apples. "Codex CLI (GPT-5.1-Codex)", which the site refers to, adds a specific agentic harness, whereas the Gemini 3 Pro seems to be on a standard eval harness.

It would be interesting to see the apples-to-apples figure, i.e. with Google's best harness alongside Codex CLI.

Palmik · 2025-11-18T14:04:57 1763474697

All evals on Terminal Bench require some harness. :) Or "Agent", as Terminal Bench calls it. Presumably the Gemini 3 are using Gemini CLI.

What do you mean by "standard eval harness"?

lucassz · 2025-11-18T22:22:19 1763504539

I think the point is that it looks like Gemini 3 was only tested with the generic "Terminus 2", whereas Codex was tested with the Codex CLI.

enraged_camel · 2025-11-18T13:31:37 1763472697

Do you mean that Gemini 3 Pro is "vanilla" like GPT 5.1 (non-Codex)?

HereBePandas · 2025-11-18T13:41:58 1763473318

Yes, two things: 1. GPT-5.1 Codex is a fine tune, not the "vanilla" 5.1 2. More importantly, GPT 5.1 Codex achieves its performance when used with a specific tool (Codex CLI) that is optimized for GPT 5.1 Codex. But when labs evaluate the models, they have to use a standard tool to make the comparisons apples-to-apples.

Will be interesting to see what Google releases that's coding-specific to follow Gemini 3.

embedding-shape · 2025-11-18T17:17:12 1763486232

> But when labs evaluate the models, they have to use a standard tool to make the comparisons apples-to-apples.

That'd be a bad idea, models are often trained for specific tools (like GPT Codex is trained for Codex, and Sonnet has been trained with Claude Code in mind), and also vice-versa that the tools are built with a specific model in mind, as they all work differently.

Forcing all the models to use the same tool for execution sounds like a surefire way of getting results that doesn't represent real usage, but instead arbitrarily measure how well a model works with the "standard harness", which if people start caring about, will start to become gamed instead.

tosh · 2025-11-18T13:16:41 1763471801

This might also hint at SWE struggling to capture what “being good at coding” means.

Evals are hard.

raducu · 2025-11-18T13:53:25 1763474005

> This might also hint at SWE struggling to capture what “being good at coding” means.

My take would be that coding itself is hard, but I'm a software engineer myself so I'm biased.

Squarex · 2025-11-19T14:17:46 1763561866

It is just Python and Django. It might indicate qualities in other technologies, but it is not very good benchmark.

JacobAsmuth · 2025-11-18T22:22:23 1763504543

50% of the CLs in SWE-Bench Verified are the DJango codebase. So if you're a big contributor to Django you should care a lot about that benchmark. Otherwise the difference between models is +-2 tasks done correctly. I wouldn't worry too much about it. Just try it out yourself and see if its any better.

aoeusnth1 · 2025-11-18T17:11:28 1763485888

Their scores on SWE bench are very close because the benchmark is nearly saturated. Gemini 3 beats Sonnet 4.5 on TerminalBench 2.0 by a nice margin (54% vs. 43%), which is also agentic coding (CLI instead of python).

varispeed · 2025-11-18T14:32:00 1763476320

Never got good code out of Sonnet. It's been Gemini 2.5 for me followed by GPT-5.x.

Gemini is very good a pointing out flaws that are very subtle and non noticeable at a first and second glance.

It also produces code that is easy to reason about. You can then feed it to GPT-5.x for refinement and then back to Gemini for assessment.

baq · 2025-11-18T14:49:03 1763477343

I find Gemini 2.5 pro to be as good or in some cases better for SQL than GPT 5.1. It's aging otherwise, but they must have some good SQL datasets in there for training.

alyxya · 2025-11-18T14:07:18 1763474838

I think Google probably cares more about a strong generalist model rather than solely optimizing for coding.

macrolime · 2025-11-18T14:09:28 1763474968

Pretty sure it will beat Sonnet by a wide margin in actual real-world usage.

I_am_tiberius · 2025-11-19T10:47:27 1763549247

I don't know if this is true but I believe Anthropic has for a long time illegally used user prompts for training, without user consent.

HereBePandas · 2025-11-18T13:17:32 1763471852

[comment removed]

Palmik · 2025-11-18T13:19:54 1763471994

The reported results where GPT 5.1 beats Gemini 3 are on SWE Bench Verified, and GPT 5.1 Codex also beats Gemini 3 on Terminal Bench.

HereBePandas · 2025-11-18T13:30:50 1763472650

You're right on SWE Bench Verified, I missed that and I'll delete my comment.

GPT 5.1 Codex beats Gemini 3 on Terminal Bench specifically on Codex CLI, but that's apples-to-oranges (hard to tell how much of that is a Codex-specific harness vs model). Look forward to seeing the apples-to-apples numbers soon, but I wouldn't be surprised if Gemini 3 wins given how close it comes in these benchmarks.

Palmik · 2025-11-18T14:04:33 1763474673

All evals on Terminal Bench require some harness. :) Or "Agent", as Terminal Bench calls it. Presumably the Gemini 3 are using Gemini CLI.

jbellis · 2025-11-19T05:24:34 1763529874

swebench is (1) terrible and (2) saturated