More

hansonw · 2025-11-19T18:19:03 1763576343

Rest assured that we are better at training models than naming them ;D

- New benchmark SOTAs with 77.9% on SWE-Bench-Verified, 79.9% on SWE-Lancer, and 58.1% on TerminalBench 2.0

- Natively trained to work across many hours across multiple context windows via compaction

- 30% more token-efficient at the same reasoning level across many tasks

Let us know what you think!

sinatra · 2025-11-19T20:02:36 1763582556

I currently use GPT‑5.1-Codex High and have a workflow that works well with the 5-hour/weekly limits, credits, et al. If I use GPT‑5.1-Codex-Max Medium or GPT‑5.1-Codex-Max High, how will that compare cost / credits / limits wise to GPT‑5.1-Codex High? I don't think that's clear. "Reduced tokens" makes me think it'll be priced similarly / lower. But, "Max" makes me think it'll be priced higher.

qsort · 2025-11-19T18:50:09 1763578209

Codex is an outstanding product and incremental upgrades are always welcome. I'll make sure to give it a try in the coming days. Great work! :)

agentifysh · 2025-11-19T18:21:14 1763576474

did you address this https://github.com/openai/codex/issues/6426 ?

how much more token efficient is this compared to 5.0

had to use 5.0 because 5.1 was eating tokens like crazy and seemed like a slight incremental improvement barely noticeable

carbocation · 2025-11-19T19:55:56 1763582156

It would be great to have access to this model via the chat interface, even if it was gated behind the "other models" dropdown or something.

iyn · 2025-11-19T18:34:56 1763577296

Looks like a great change! I'll take it for a spin in a moment.

I really like the "subagent" feature in Claude Code — it's super useful to manage context in complex codebases. Here are some examples of agents that can be useful: https://github.com/humanlayer/humanlayer/tree/main/.claude/a...

Would it make sense to have a similar feature in Codex CLI? I often do "spec-driven development", which is basically a loop of:

    research -> implementation plan -> actual implementation (based on research + plan) -> validation

I have multiple subagents that I use for each phase that (based on subjective judgement) improve the output quality (vs keeping everything, every tool use etc. in the "main" context window).

Codex CLI is great and I use it often but I'd like to have more of these convenient features for managing context from CC. I'm super happy that compaction is now available, hopefully we'll get more features for managing context.

killcoder · 2025-11-20T03:47:48 1763610468

It would be nice if users of the codex-cli that are just using API keys as a way to handle rate limits and billing could receive these new models at the same time. I appreciate the reasoning behind delayed 'actual API' release, but I've found the rate limiting to be quite annoying, and my own API keys don't have this limitation.

ineedasername · 2025-11-20T06:38:04 1763620684

Re: rate limits, I'm not sure they can, yet, on capacity. See Jensen's comment today about their cloud GPUs being sold out. So capacity increased await the ongoing data center build out.

killcoder · 2025-11-23T04:05:14 1763870714

> 30% more token-efficient at the same reasoning level across many tasks

But they're claiming it's more token efficient, so me switching my usage to the new model should _free up_ capacity.

NitpickLawyer · 2025-11-19T18:35:49 1763577349

Will -minis come for the codex family of models? About two months ago I used 5-mini as a daily driver for a few weeks and quite liked it, it seemed capable enough on small tasks with some hand holding and the speed/price were great as well.

coder543 · 2025-11-19T19:00:09 1763578809

codex-mini was released a couple of weeks ago: https://platform.openai.com/docs/models/gpt-5.1-codex-mini

NitpickLawyer · 2025-11-19T19:09:35 1763579375

Thanks! I somehow missed that. Will check it out.

robotswantdata · 2025-11-19T19:22:14 1763580134

Sorry don’t like the max model, feels like it needs a lot more guiding. The plans it writes however are better, so I tried feeding it back in (meta prompt style) and working okay so far. Very large repository.

baby · 2025-11-19T23:41:14 1763595674

Did you guys fix not being able to enable websearches or configure no timeouts for specific commands in the SDk (error 124 is way too common for long running tasks)

andai · 2025-11-19T19:52:48 1763581968

So context window is still 400k but the model got good at removing irrelevant context?

baby · 2025-11-19T23:43:00 1763595780

Or is more succinct in its thoughts

SoKamil · 2025-11-19T22:54:30 1763592870

> Natively trained

What does it even mean?

kaveh_h · 2025-11-19T23:20:38 1763594438

Probably that before it was given system instructions on how to do compaction and now the compaction is learned by the model making it a native ability of the model without any extra instruction used in the prompt.

ineedasername · 2025-11-20T06:43:48 1763621028

Continuous pre training or fine tuning, instead of inference-time instructions. It's also possible synthetic data for this purpose was in the pre training as well, and they're now getting it to behave the way they'd like.

EnPissant · 2025-11-19T18:27:18 1763576838

Compaction is just what Claude Code has done forever, right?

GardenLetter27 · 2025-11-19T18:32:37 1763577157

I think the point here is not that it does compaction (which Codex also already does) - but that the model was trained with examples of the Codex compaction, so it should perform better when compaction has taken place (a common source for drops in performance for earlier models).

EnPissant · 2025-11-19T18:35:14 1763577314

Codex previously did only manual compaction, but yeah, maybe some extra training for compaction, too?

enraged_camel · 2025-11-19T18:28:52 1763576932

I am also trying to understand the difference between compaction, and what IDEs like Cursor do when they "summarize" context over long-running conversations.

Is this saying that said summarization now happens at the model level? Or are there other differences?

typpilol · 2025-11-20T07:31:43 1763623903

Afaik, there's no difference besides how aggressive or not it is.

But it's the same concept. Taking tokens in context and removing irreverent ones by summarizing, etc

baby · 2025-11-19T23:44:01 1763595841

Codex couldnt do what claude did before when reaching full context window

d4rkp4ttern · 2025-11-20T12:13:26 1763640806

My understanding is that they trained it to explicitly use a self-prune/self-edit tool that trims/summarizes portions of its message history (e.g. use tool results from file explorations, messages that are no longer relevant, etc) during the session, rather than "panic-compact" at the end. In any case, it would be good if it does something like this.

baby · 2025-11-19T23:43:17 1763595797

Yes. It was missing in codex until now

blks · 2025-11-19T21:18:21 1763587101

[flagged]

meowface · 2025-11-19T21:28:43 1763587723

I would bet a lot of money it will not.

blks · 2025-11-23T14:57:11 1763909831

I don’t see how their business would succeed. So far they are burning billions of investment dollars on compute with barely any revenue. Side hustles like Sora are a disaster that costs so much money for each video and will never bring any money

hansonw · 2025-05-16T19:08:22 1747422502

More about that here! https://platform.openai.com/docs/codex#advanced-configuratio...

sudohalt · 2025-05-16T22:17:02 1747433822

It seems LLMs are doing a lot of the heavy lifting figuring out the exact test, build, lint commands to run (even if the AGENTS.md file gives it direction and hints). I wonder if there are any plans to support user defined build, test, and pre commit commands to avoid unnecessary cost and keep it deterministic. Also wonder how monolith repos (or distinct but related repos) are supported, does it run everything in one container or loop through the envs that are edited?

I assume one easy next step is to just run GitHub Actions in the container since everything is defined there (assuming the user set it up)

sudohalt · 2025-05-16T22:00:39 1747432839

Thanks!

hansonw · on Nov 4, 2024

The ELI5 of the paper is that most "unlearning" methods can be regarded as adding some delta `w` to the parameters of the network, but most of `w` just gets "rounded away" during quantization (i.e. `quantize(X+w) ~= quantize(X)`). Pretty clever idea as a lot of cited methods explicitly optimize/regularize to keep `w` small to avoid degrading evaluation accuracy.

To your point, it does put into question the idea of whether these methods can actually be considered truly "unlearning" from an information-theoretic perspective (or if it is the equivalent of e.g. just putting `if (false)` around the still latent knowledge)

hansonw · on Aug 20, 2024

It looks like they didn't want to make a public submission in order to avoid disclosing the model internals: https://cosine.sh/blog/genie-technical-report#:~:text=SWE%2D....

hansonw · on Aug 18, 2024

It’s probably more. Pretty conservatively, if the KV embedding dimension for each token is ~10K x 100 attention layers (this is roughly the scale of Llama3.1 405B) that’s already 1M 16-bit floats per token = 2MB. They have likely needed to implement some kind of KV compression (like DeepSeek) to make this even feasible.

hansonw · on June 15, 2024

https://news.ycombinator.com/item?id=40675577

groby_b · on June 18, 2024

Thank you, that was indeed the one!

memothon · on June 19, 2024

Thanks all!

hansonw · on April 27, 2024

This is also a good paper on the subject:

What Algorithms can Transformers Learn? A Study in Length Generalization https://arxiv.org/abs/2310.16028

shawntan · on April 27, 2024

Yes this is a good empirical study on the types of tasks that's been shown to be impossible for transformers to generalise on.

With both empirical and theoretical support I find it's pretty clear this is an obvious limitation.

hansonw · on April 24, 2024

https://predibase.com

anon373839 · on April 24, 2024

Nice find! Better pricing than Replicate, too.

hansonw · on April 9, 2024

Yes. But also note that the new function calling is actually “tool calling” where the model is also fine-tuned to expect and react to the output of the function (and there are various other nuances like being able to call multiple functions in parallel and matching up the outputs to function calls precisely).

When used in multi-turn “call/response” mode it actually does start to unlock some new capabilities.

hansonw · on April 8, 2024

Not the author, but really nice that they shared some real data points:

> Once our Text-to-SQL solution was in production, we were also able to observe how users interacted with the system. As our implementation improved and as users became more familiar with the feature, our first-shot acceptance rate for the generated SQL increased from 20% to above 40%. In practice, most queries that are generated require multiple iterations of human or AI generation before being finalized. In order to determine how Text-to-SQL affected data user productivity, the most reliable method would have been to experiment. Using such a method, previous research has found that AI assistance improved task completion speed by over 50%. In our real world data (which importantly does not control for differences in tasks), we find a 35% improvement in task completion speed for writing SQL queries using AI assistance.