More

brianyu8 · 2025-12-22T18:19:33 1766427573

I am super bullish on claude code / codex cli + LSP and other deterministic codemod and code intelligence tools.

I was playing around with codex this weekend and honestly having a great time (my opinion of it has 180'd since gpt-5.2(-codex) came out) but I was getting annoyed at it because it kept missing references when I asked it to rename or move symbols. So I built a skill that teaches it to use rope for mechanical python codebase refactors: https://github.com/brian-yu/python-rope-refactor

Been pretty happy with it so far!

lionkor · 2025-12-22T22:16:33 1766441793

OpenAI engineer fails to rename references because his F2 key has been replaced with the Copilot button?

No LSP support is wild.

shimman · 2025-12-22T23:03:17 1766444597

This is something I notice often when using these tools (if this is what you are referring too). Like they will grep entire code bases to search for a word rather than search by symbol. I suppose they don't care to fix these types of things as it all adds up to paid tokens in the end.

We have 50 years worth of progress on top of grep and grep is one of the worse ways to refactor a system.

Nice to see LLM companies are ignoring these teachings and speed running into disaster.

nvarsj · 2025-12-23T08:56:05 1766480165

Only if they are not told how to search the codebase efficiently. All you need is an MCP server for code search. There's even LSP backed MCP servers now.

shimman · 2025-12-23T15:04:31 1766502271

I see, I'm highly skeptical of using these tools because I honestly feel faster with a vim + clt workflow if I know what to write.

I'll have to check again because 6 months ago this stuff was pure trash and more frustrating than useful (beyond a boilerplate generate that also boils the ocean).

theptip · 2025-12-23T16:28:58 1766507338

Yes, check again - to be blunt, any opinions (at least tactical on how well feature X works) formed 6 months ago are not really relevant to the conversation today given how fast this is all moving.

Opus 4.5 in Claude Code is a massive jump over 4.0 which is a massive jump over 3.7.

Each generation is being fine-tuned on a huge corpus of freshly-generated trajectories from the previous generation so things like tool use improve really quickly.

qiine · 2025-12-23T16:26:16 1766507176

> grep is one of the worse ways to refactor Hum? care to explain this?

maccard · 2025-12-23T17:27:49 1766510869

Using Grep or regex is textual refactoring. If you want to rename every reference to a type Foo, how do you is that without touching any variables named foo, or any classes named FooBar

The answer is use tools that have semantic info to rename things.

true_religion · 2025-12-24T03:16:30 1766546190

I often want them to rename all the textual references too because otherwise you have bunch of variables using the old name as a reference.

Even though it has no semantic significance to the compiler, it does for all the human beings who will read it and get confused.

shimman · 2025-12-24T02:08:01 1766542081

Another poster mentioned using symbols and references, another way to refactor code programmatically is to make use of code mods. Code mods are very powerful and this is a use case where I find LLMs to shine as the various syntax and language ASTs are hard to remember (even if you do understand what you're doing).

shepherdjerred · 2025-12-22T18:35:04 1766428504

Are you having a positive experience with Codex compared to Claude Code? Codex in my brief experience was... not good w/ 5.1

cube2222 · 2025-12-22T18:39:39 1766428779

Just to provide another datapoint - tried codex September / October after seeing the glowing reviews here, and it was, all in all, a huge letdown.

It seems to be very efficient context-wise, but at the same time made precise context-management much harder.

Opus 4.5 is quite a magnificent improvement over Sonnet 4.5, in CC, though.

Re tfa - I accidentally discovered the new lsp support 2 days ago on a side project in rust, and it’s working very well.

linsomniac · 2025-12-23T15:43:15 1766504595

I was luke-warm about codex when I tried it 2-3 months ago, but just recently tried it again last week, running it against claude code, both of them running against the same todo list to build a docusign-like web service. I was using loops of "Look at the todo list and implement the next set of tasks" for the prompt (my prompt was ~3 sentences, but basically saying that):

    - Codex required around 30 passes on that loop, Claude did it in ~5-7.
    - I thought Codex's was "prettier", but both were functional.
    - I dug into Claude's result in more depth, and had to fix ~5-10 things.
    - Codex I didn't dig into testing quite as deeply, but it seemed to need less fixing.  Still not sure if that is because of a more superficial view.
    - Still a work in progress, have not completed a full document signing workflow in either.

fluidcruft · 2025-12-23T00:50:05 1766451005

Similar experience and timeline with codex, but tried it last week and it's gotten much better in the interim. Codex with 5.2 does a good job at catching (numerical) bugs that Opus misses. I've been comparing them and there's not a clear winner, GPT 5.2 misses things Opus finds and vice versa. But claude-code is still a much better experience and continues to just keep getting better but codex is following, just a few months behind.

allisdust · 2025-12-22T18:56:45 1766429805

Another anecdote/datapoint. Same experience. It seem to mask a lot of bad model issues by not talking much and overthinking stuff. The experience turns sour the more one works with it.

And yes +1 for opus. Anthropic delivered a winner after fucking up the previous opus 4.1 release.

HarHarVeryFunny · 2025-12-23T16:54:43 1766508883

What are some of the use cases for Claude Code + LSP ? What does LSP support let you do, or do better, that Claude Code couldn't do by itself ?

kohlerm · 2025-12-23T07:06:07 1766473567

I checked the codex source code a few months ago and the implementation was very basic compared to opencode

theshrike79 · 2025-12-22T21:21:24 1766438484

It goes like this:

Codex is an outsourcing company, you give specs, they give you results. No communication in between. It's very good at larger analysis tasks (code coverage, health etc). Whatever it does, it does it sloooowwwllyyy.

Claude is like a pair programmer, you can follow what it's doing, interrupt and redirect it if it starts going off track. It's very much geared towards "get it done" rather than maximum code quality.

aschobel · 2025-12-22T22:44:50 1766443490

I’m basically only using the Codex CLI now. I switched around the GPT-5 timeframe because it was reliably solving some gnarly OpenTelemetry problems that Claude Code kept getting stuck on.

They feel like different coworker archetypes. Codex often does better end-to-end (plan + code in one pass). Claude Code can be less consistent on the planning step, but once you give it a solid plan it’s stellar at implementation.

I probably do better with Codex mostly due to familiarity; I’ve learned how it “thinks” and how to prompt it effectively. Opus 4.5 felt awkward for me for the same reason: I’m used to the GPT-5.x / Codex interaction style. Co-workers are the inverse, they adore Opus 4.5 and feel Codex is weird.

__mharrison__ · 2025-12-22T23:46:04 1766447164

I've gone it works wonderful for 5.2. I think chatgpt plus is at the top of the weekly AI rolling wars. Most bang for the buck.

frays · 2025-12-22T19:13:39 1766430819

Interesting to see that you work at OpenAI but had to build a skill like this yourself.

Surprised that you don't have internal tools or skills that could do this already!

Shows how much more work there is still to be done in this space.

voiper1 · 2025-12-22T19:32:04 1766431924

My theory is that even if the models are frozen here, we'll still spend a decade building out all the tooling, connections, skills, etc and getting it into each industry. There's so much _around_ the models that we're still working on too.

nonethewiser · 2025-12-23T03:19:40 1766459980

Agree completely. It's already been like this for 1-2 years even. Things are finally starting to get baked in but its still early. For example, AI summaries of product reviews, gemini youtube video summaries, etc..

Its hard to quantify what sort of value those examples generate (youtube and amazon were already massively popular). Personally I find it very useful, but it's still hard to quantify. It's not exactly automating a whole class of jobs, although there are several youtube transcription services that this may make obsoete.

NitpickLawyer · 2025-12-22T19:37:05 1766432225

> Shows how much more work there is still to be done in this space.

This is why I roll my eyes every time I read doomer content that mentions an AI bubble followed by an AI winter. Even if (and objectively there's 0 chance of this happening anytime soon) everyone stops developing models tomorrow, we'll still have 5+ years of finding out how to extract every bit of value from the current models.

agumonkey · 2025-12-22T21:32:38 1766439158

One thing though, if the slowdown is too abrupt, it might forbid openai, anthropic etc to keep financially running datacenters for us to use.

imiric · 2025-12-22T20:11:25 1766434285

The idea that this technology isn't useful is as ignorant as thinking that there is no "AI" bubble.

Of course there is a bubble. We can see it whenever these companies tell us this tech is going to cure diseases, end world hunger, and bring global prosperity; whenever they tell us it's "thinking", can "learn skills", or is "intelligent", for that matter. Companies will absolutely devalue and the market will crash when the public stops buying the snake oil they're being sold.

But at the same time, a probabilistic pattern recognition and generation model can indeed be very useful in many industries. Many of our problems can be approached by framing them in terms of statistics, and throwing data and compute at them.

So now that we've established that, and we're reaching diminishing returns of scaling up, the only logical path forward is to do some classical engineering work, which has been neglected for the past 5+ years. This is why we're seeing the bulk of gains from things like MCP and, now, "agents".

NitpickLawyer · 2025-12-22T20:21:36 1766434896

> This is why we're seeing the bulk of gains from things like MCP and, now, "agents".

This is objectively not true. The models have improved a ton (with data from "tools" and "agentic loops", but it's still the models that become more capable).

Check out [1] a 100 LoC "LLM in a loop with just terminal access", it is now above last year's heavily harnessed SotA.

> Gemini 3 Pro reaches 74% on SWE-bench verified with mini-swe-agent!

[1] - https://github.com/SWE-agent/mini-swe-agent

imiric · 2025-12-22T20:49:47 1766436587

I don't understand. You're highlighting a project that implements an "agent" as a counterargument to my claim that the bulk of improvements are from "agents"?

Sure, the models themselves have improved, but not by the same margins from a couple of years ago. E.g. the jump from GPT-3 to GPT-4 was far greater than the jump from GPT-4 to GPT-5. Currently we're seeing moderate improvements between each release, with "agents" taking up center stage. Only corporations like Google are still able to squeeze value out of hyperscale, while everyone else is more focused on engineering.

losvedir · 2025-12-23T05:21:04 1766467264

They're pointing out that the "agent" is just 100 lines of code with a single tool. That means the model itself has improved, since such a bare bones agent is little more than invoking the model in a loop.

imiric · 2025-12-23T07:37:40 1766475460

That doesn't make sense, considering that the idea of an "agentic workflow" is essentially to invoke the model in a loop. It could probably be done in much less than 100 lines.

This doesn't refute the fact that this simple idea can be very useful. Especially since the utility doesn't come from invoking the model in a loop, but from integrating it with external tools and APIs, all of which requires much more code.

We've known for a long time that feeding the model with high quality contextual data can improve its performance. This is essentially what "reasoning" is. So it's no surprise that doing that repeatedly from external and accurate sources would do the same thing.

In order to back up GP's claim, they should compare models from a few years ago with modern non-reasoning models in a non-agentic workflow. Which, again, I'm not saying they haven't improved, but that the improvements have been much more marginal than before. It's surprising how many discussions derail because the person chose to argue against a point that wasn't being made.

losvedir · 2025-12-23T17:07:19 1766509639

The original point was that the previous SotA was a "heavily harnessed" agent, which I took to mean it had more tools at its disposal and perhaps some code to manage context and so on. The fact that the model can do it now in just 100 LoC and a terminal tool implied the model itself has improved. It's gotten better at standard terminal commands at least, and possibly bigger context window or more effectively using the data in its context window.

Those are improvements to the model, albeit in service of agentic workflows. I consider that distinct from improvements to agents themselves which are things like MCP, context management, etc.

IanCal · 2025-12-22T21:55:24 1766440524

I think the point here is that it’s not adding agents on top but the improvements in the models allow the agentic flow.

emp17344 · 2025-12-23T03:15:45 1766459745

But that’s not true, and the linked agentic design is not a counterargument to the poster above. The LLM is a small part of the agentic system.

IanCal · 2025-12-23T09:24:36 1766481876

LLMs have absolutely got better at longer horizon tasks.

jameslk · 2025-12-23T08:06:44 1766477204

Useful technology can still create a bubble. The internet is useful but the dotcom bubble still occurred. There’s expectations around how much the invested capital will see a return and growing opportunity cost if it doesn’t, and that’s what creates concerns about a bubble. If a bubble bursts, the capital will go elsewhere, and then you’ll have an “AI winter” once again

shermantanktop · 2025-12-22T19:26:06 1766431566

Cobbler’s children…

rglynn · 2025-12-23T10:49:15 1766486955

I've had a number of occasions where claude (et al.) have incorrectly carried out a task involving existing code (e.g. create a widget for foo, following bar's example). In these cases the way I would have done it would be to copy said existing code and then modify the copied code. I've always wondered if they should just be using copy tool (even just using xclip) instead of using context.

brianyu8 · 2025-11-14T23:44:31 1763163871

Brian on the OpenAI API team here. I would love to help you get to the bottom of the structured outputs issues you're seeing. Mind sending me some more details about your schema / prompt or any request IDs you might have to by[at]openai.com?

jawiggins · 2025-11-15T04:05:53 1763179553

Thanks so much for reaching out, sent an email :).

brianyu8 · on Sept 5, 2024

If you liked this blog post, I can’t recommend PyMOTW[0] highly enough. It’s my goto for a concise introduction whenever I need to pick up a new Python stdlib module.

[0]: https://pymotw.com/3/

brianyu8 · on July 5, 2024

This visualization is super cool! Thanks for sharing.

brianyu8 · on Dec 23, 2022

(2012)

zozbot234 · on Dec 23, 2022

Discussed:

https://news.ycombinator.com/item?id=4796586 (2012, 1 comment)

https://news.ycombinator.com/item?id=4915328 (2012, 133 comments)

https://news.ycombinator.com/item?id=6953774 (2013, 102 comments)

https://news.ycombinator.com/item?id=20142828 (2019, 86 comments)

dang · on Dec 23, 2022

Added. Thanks!

brianyu8 · on Dec 13, 2022

Assuming you have experience in software, then https://www.llnl.gov/join-our-team/careers/find-your-job/0d6...

lostmsu · on Dec 13, 2022

BTW, they don't seem to have software roles at NIF: https://www.llnl.gov/join-our-team/careers/find-your-job/liv...

lostmsu · on Dec 13, 2022

Aren't they required to post salary ranges by some Californian law?

emmp · on Dec 14, 2022

These are federal jobs so all of the pay bands are public knowledge.

dplavery92 · on Dec 14, 2022

This is not quite correct. LLNL is a Federally Funded Research & Development Center (FFRDC) which is owned, as a facility, by the government, but managed and staffed by a non-profit contracting organization called Lawrence Livermore National Security, LLC (LLNS) under a contract funded by DOE/NNSA. The board of LLNS is made up of representatives from universities (California + TAMU), other scientific non-profits (Battelle Memorial Institute), and private nuclear ventures (e.g. Bechtel.) LLNS pays, with very few exceptions, staff salaries at LLNL, and they are not beholden to the government civilian pay schedule.

https://www.llnl.gov/about/management-sponsors

donquixote25 · on Dec 13, 2022

Law goes into effect next year.

cheriot · on Dec 13, 2022

Is it enforceable on the Feds?

brianyu8 · on Dec 13, 2022

What an amazing achievement. I was curious, so I looked up the open software roles at LLNL[0]. I'm very curious how the salary compares to your average bay area tech salary.

[0]: https://www.llnl.gov/join-our-team/careers/find-your-job/0d6...

nonethewiser · on Dec 13, 2022

See for yourself https://www.glassdoor.com/Salary/Lawrence-Livermore-National...

145k TC for Software Engineer. Better than I'd expect from a government job, although a qualified applicant could obviously make so much more elsewhere. And supporting the needs of a bunch of PhD's doesn't really sound fun.

brianyu8 · on Nov 21, 2022

Fascinating article, thank you for sharing.

I found it incredible how several of the siblings are now well-known artists[0]! What a life these people have led.

[0]: https://en.wikipedia.org/wiki/Warlimpirrnga_Tjapaltjarri

brianyu8 · on Aug 28, 2022

> the 20-pound heart seems to be in good shape for its age. Still, there’s something about it—its pale color and gigantic proportions—that make it seem unreal. According to Da Silva, this dilation was probably caused by tuberculosis, which can cause the swelling of some organs.

Apparently a normal human heart weights 10 ounces[0]. That means Dom Pedro's heart is 32x the weight of a normal heart! Can all that extra weight truly be caused by tuberculosis alone?

[0]: https://my.clevelandclinic.org/health/body/21704-heart#:~:te....

worewood · on Aug 28, 2022

I guess they included the Formaldehyde weight in the total.

brianyu8 · on Oct 15, 2020

You're right about that -- Stripe led Paystack's series A and has also led other series A rounds like Fast's.