Hey there, article author here! I spent a lot of time writing up a comment that talks through my process so I'll just lazy link it here if that's ok. [^1]
But the truth is I do build real production features all the time! That just wasn't the focus of this article. :)
Hey there, post author here. I do actually generate days worth of code in minutes and weeks worth of code in hours (not minutes) — but didn't really cover them because this post was more conceptual than specifically covering a tactic or technique as other posts do.
But in case you're interested a talk that I gave in Spain this year just went live yesterday, and discusses not only real world use cases but also discusses a lot of the fundamentals of how these AI systems work to make that possible.
Hey Jose, author here! That's a great call out. I write predominantly in Swift and for a long time Claude was the only usable option. But sometime around GPT-5 OpenAI's models got much better at Swift, so the choice started becoming more about aesthetics (as a descriptor of preferences). So you're right — if the model can't write coherent code then it doesn't matter what kind of flow you feel as you're working with the tools — but I do imagine this will continue to improve for all languages including Elixir.
If it's okay to mention my own project, I'd appreciate it if you could check out https://charleswiltgen.github.io/Axiom/ (open source) and let me know what you think. It's focused on modern Swift, with specialized skills for helping developers get to strict Swift 6 concurrency.
I've only skimmed it since I'm between Christmas and a longer vacation that starts in 24 hours, but this actually looks really neat! I'll definitely take a closer look to at these skills in depth — but this is exactly the kind of thing I've been telling people to take the time to invest in for their agentic environments. :)
Heya, author of the post here. That's a good call out because it's probably a lot!
And now that you mention it, that's also one failure case for why some people look at AI and go "this just isn't very good at coding". I'm not saying it has to be that way nor will it be that way forever, but there are absolutely a lot of people who just download Claude Code or Cursor or Codex and dive right in without any additional set up.
That's partially why I suggest people use Codex for the workshops I offer, because it provides the best results with no set up. All of these tools have a nearly unending amount of progressive disclosure because there's so much invisible configuration and best practices are changing so fast. I'm still trying not to imply that one tool is "better" than another (even if I have my preference), but more so hit on the fact that which AI tools people like is mostly about your preferred set of tradeoffs.
What if I said "I don't use any of those and I think AI is very good at coding".
I'd be more interested in an article from you on how to go from "I use Claude Code out of the box" as a baseline and then how each extra layer (CLAUDE.md, skills, agents, MCP, etc.) improve its effectiveness.
That does sound like a good article! Sadly I've never written it and probably won't because I think a lot of that stuff doesn't provide as much value as people assume — which is why my personal conclusion in the post is to just lean into Codex.
This isn't a value judgment, it's just a question of where my priorities and tradeoffs lie. That said, I think Skills are the killer feature because they are a very composable tool — which I'll get to in a bit.
- Your CLAUDE.md should be a good high-level description with relevant details that you add to over time. Think of it as describing the lay of the land the way you would to a new coworker, including the little warts they need to know about before they waste hours on some known but confusing behavior.
- MCP has it's purposes, but it's not really a great tool for software development. It's best served for interfacing with a remote service (because it provides a discovery layer to LLMs on top of an API), but if you use them the way developers are told to, you're almost always better off using an equivalent CLI.
- I'll skip over agents, because an agent is basically a skill + a separate context window, and the main selling point is the context window bit. I think over time we'll see a separation of concerns where you can just spawn a skill with a context window and everyone will forget about the idea of agents in your codebase.
So now Skills. I wrote a well-received post [^1] a few months ago about Claude Skills, and why I think they are probably the most important of these tools. A skill is basically a plain-text description of an application.
The app can be something like I describe where Claude Code converts a YouTube video to an mp3 based on natural language, or you can have a code review skill, a linter skill, a security reviewer skill, and so on. This is what I meant when I said skills are composable.
You can imagine a team having lots of skills in their repo. One may guide an agentic system to build iOS projects well (away from an LLM's bad defaults when building with Xcode), skills that are very contextually relevant to the team, or even skills that enforce copy in your app to conform to your marketing team's language.
Skills are just markdown so they're very portable — and now available in Codex and many other places.[^2] (I had been using OpenSkills to great effect since the way Skills work is just through prompts). I now have a bunch of skills that do lots of things, for coding, marketing, data analysis, fact checking, copy-editing, and more. As a nice benefit they run in Claude — not just Claude Code. If you have ideas for processes you need to improve, I would invest my time and energy into building up Skills more than anything else.
What % of effectiveness would you say is gained from these because... I am a pretty regular user of Claude Code in VS Code with no special goodies and I routinely hit "compacting" after 5-6 prompts in a single session. Then I need to validate it didn't slop all over the place. I can't imagine having 3-4 agents in the background (extra things to check) being a net-positive.
Skills and slash commands for sure but... I don't see them as necessary? "Review this code for ____ XYZ" as a skill. To me it's just something you could do as a prompt in your session?
1. I find Claude Code's handling of the context window to be pretty poor, and one of the reasons why I use it for smaller things versus multi-hour coding sessions. I'm not sure what dark magic OpenAI has done to make their context window feel infinite, but Codex has become a better choice for that at the moment.
2. A small note on subagents but Claude Code did this right. Subagents are granted their own context window, so they don't spill over into your context window until they're done doing their own work — and the added context is relatively minimal. I'd love to see OpenAI adopt this pattern as well, especially in combination with something like Skills rather than leaning into MCP.
3. When I suggested adding skills, I mean ones that are far more complicated than your example, and can drive a chunk of work autonomously. The skill I use for writing in-app copy (which I'm bad at because you can see I'm never short for words) is about 100 lines long. It includes my style guide as an accessible resource, and a mostly complete history of my Bluesky posts to help achieve the authentic tone I when discussing Plinky. (I write all of my posts, so this really is my voice.)
These kinds of skills save me a lot of time as an indie developer! As I mentioned I have ones for data insights, fact-checking, and of course for code. My main suggestion would be to think through every step of your work and see if they can be automated, and then turn small pieces of that into skills.
—--
It's hard to assign a specific percentage to how much my effectiveness has improved, but it's a lot. The reason I don't want to put a number on it is that what I've gotten is a far broader set of skills (no pun intended) that allows me to execute in parallel. The metaphor I'd use to describe all this is to say that I'm no longer single-threaded.
I am a big believer that right now models work best for people who are effectively running small businesses — or teams that operate lean. The work of 10 can be done by 4-5 motivated and well-armed people, or an indie like me can do every facet of the work involved and do it well. I sit down and focus on explaining the big picture with great detail, and then set things off so I can do every part of the work involved in a round-robin style.
While an engineering task is going I'm off writing my newsletter with my words but with a skill that does meaningful research for me. While I'm running some research I'm in Figma working on social media assets. While I'm doing code review for my app's code I've got the server side building in the background.
Last week I had Codex finding a domain for me, with specific requirements. (Here's a simplified version of the prompt.)
> I need a domain to represent this concept [+ 200 words], based on the code in this repository. [Code included so Codex really knows what the heck I'm building and talking about.] Don't show me any domains over $50/year at this registrar. Make sure it's a real word with no fun typos like tumblr.com is short for tumbler, and no compound words like "thisisfun.com". You can start with this list of tlds, but if you think there are any other ones that could be a good match then you can make a suggestion.
And after about 10 messages back and forth Codex found something that would have taken me far longer to research on my own — in parallel.
This all means that I'm able to write code, do marketing, design, support (which is always me and not AI), and run my business. If I plan well what I get is an extra set of hands to hand things off to, and most of the time (honestly) it does the work perfectly. But even for the times it doesn't, if it gets me 80-90% of the way there, that's a huge head start over where I would have been previously.
So the reason that I'm hesitant to answer this with a specific percentage is that your experience across organizations will vary. But I've seen in my work (solo engineering work, teaching, and consulting) is that the gains are pretty prosperous. That's true for roles where you're singularly focused on writing code — but the key is to lean into the strengths of this system and be creative about how you use it.
As I said — incapable of keeping my writing short so I hope that helps!
> That's partially why I suggest people use Codex for the workshops I offer, because it provides the best results with no set up.
I would do the exact opposite… If we are pitching “this shit works magically without any setup” people will expect magic and they absolutely will not get magic as there is no magic. I believe, especially if we are educators (you obviously are!!) that it is our responsibility to teach it “right” - my workshop would probably spend at least 75% of the time on the setup
So I definitely understand where you're coming from, but let me provide a little bit of context.
The workshops are 3-4 hours and we do spend a lot of time discussing how things work in reality vs. how they work in the context of the workshop. It's worth noting that these workshops span the gamut of non-technical people in sales to seasoned developers, so a lot of people simply won't learn much (or have the excitement to learn on their own) if we spend the first 2-3 hours setting things up.
In my experience the heaviest lift for teaching practically any technical subject is getting someone interested by showing them how to accomplish something they care about, and then leaving them with lots of information and resources so they can continue experimenting and growing even once we're done. The way I do that is to make sure they leave the workshop having built their own idea — without taking shortcuts!
Being able to use Codex to accomplish something because you spent an hour crafting a good prompt isn't cheating, it's learning the skill of becoming a better technical communicator — in a short period of time — and taking advantage of the skill you've just learned. I don't consider that magic, it's actually the core tenant of building with AI, and is very much how I work with AI every day.
I'm late for dinner so I should probably stop here, so I'll leave just one final note. After every workshop I send each student a list of personalized resources that will help them continue on their journey by demystifying things that we may have glossed over or weren't clear in the workshop — so they should be armed with the tools to take their next steps away from any magical thinking.
It's a bit hard to boil down exactly what I do and how I try to design for best hands-on pedagogical practices in an HN post I'm writing on the go — but I am absolutely open to your thoughts! :)
Heya, author here! I do agree with you that this is a big downside, but I don’t know if this is the primary reason.
In my experience teaching people, most people don’t actually know much at the time they make this decision. They’ve heard about Cursor, they’ve heard of Claude Code, and they may have heard about Codex. But what they’ve heard is anecdotes and marketing — they don’t yet have hands-on experience.
They make a big choice and then assume that this is how all AI works, because they don’t have a full breadth of context yet. And that’s to be expected! That’s how most things work.
That is why I teach the workshops I do to make AI accessible, so people can walk through the tradeoffs and make the best educated choices for them.
A couple of comments here have said that the post is subtly pro-Codex, but I tried to make my point very explicit: people should try a lot of things and see what works best for them. But it’s very hard to do that without investing a lot of time because the market is so nascent and moving so fast. This post exists to try and nudge people into exploring more of the tools they haven’t tried yet, so they can make their own informed decisions like you have. :)
All that’s to say, people definitely hit limits with Claude Code (as I have done myself) — especially if they’re hesitant to upgrade to Claude Max because they haven’t gotten enough out of Claude Pro. But I think the real reason people make the choices they do starts earlier in the process, even before they get a lot of hands on experience with Claude Code or Codex.
Heya, I’m the author of the post! To be clear I have AI write probably 95% of my code these days, but I review every line of code that AI writes to make sure it meets my high standards. The same rules I’ve always had still apply — to quote @simonw “your job is to deliver code you have proven to work”.
So while I’m enthusiastic about AI writing my code in the literal sense, it’s still my code to understand and maintain. If I can’t do that then I work with AI to understand what was written — and if I can’t then I’ll often give it another go with another approach altogether so I can generate something I can understand. (Most of the time working together to understand the code works better, because I love to learn and am always open to pushing my boundaries to grow — and this process can tuned well to self-directed learning.)
And to quote a recent audit: “this is probably one of the cleanest codebases I’ve ever audited.” I say that emphasize the fact that I care a lot about the code that goes into my codebase, and I’m not interested in building layers of unchecked AI slop for code that goes into my apps.
personally it would be too difficult for me to understand large chunks of work, like in your case "a week's worth of code" at a time. just wondering how do you go about it?
second, how do you pass such large PR's to your co-workers?(if you have any)
So I will state upfront that my current experience is not the most common team dynamic because I'm an indie developer [^1]. But I've worked at many companies — as small as 2 and as large as Twitter — so I am very familiar with the variety of engineering processes.
I can share how I work with agentic systems, because I (and now others) have found it to be very effective. I still have the engineering-like experience of thinking deeply — I've gotten great results across codebases small and large — and almost everyone who I've run a workshop with has come back to me and said that this was a missing piece for them when they work with agentic systems.
I'm the kind of person I alluded to at the end of my blog post when I wrote "Some people couldn’t start coding until they had a checklist of everything they needed to do to solve a problem.", so this description will be representative of that.
1. I start a document in Craft [^2] whenever I think of a great feature, and keep adding to that doc over the next few months whenever I have a new idea. I try to turn that document into something cohesive — imagine something like a PRD without the formality.
2. Then when it comes time to build the feature, I will just sit and write out a prompt (with lots of pointers to source code and relevant screenshots) that considers everything that needs to be built. I'll write out our goals for the feature, how the client should work, how the server should behave, the expected user experience, and anything else that's relevant. That process is really clarifying because it unearths a whole bunch of meaningful context — and context is exactly what a large language model needs!
3. Last but not least I'll simply add something like "Please ask any clarifying questions you may have, or for any additional details that you may find helpful". That leads to questions which I spend anywhere from another 5 to 30 minutes on, which fills in the gaps that I hadn't even considered to consider. And sure that may take time, but now the model has *so many useful details* that most people never add to their context window.
4. Once you have that, the model can act much more surgically than the experience most people have with agentic systems. Since it's so surgical I can go do something else like work on my newsletter, my AI workshops, or even go for a walk. This is why I much prefer to work this way, as opposed to the hands-on process I described Claude Code users [often] preferring in the blog post. (Which as I mentioned there is perfectly fine, just not my cup of tea anymore.)
---
I'd still like to touch on working with people though. I do quite a bit of open source work and there I still follow what people would consider standard processes and best practices. If I'm doing a week's worth of work I still don't want to dump a whole ton of code in one commit, so I'll break everything down into very atomic commits that spell out exactly what I'm doing. I also write lots of documentation, update references, and add tests like a person should.
But there's also nothing to say you have to generate a week's worth of code in one go. It's important to remember that you're in control of how you work. It may be more fitting to define smaller tasks (which will take less time for each independent step) and work on them serially, which you can then hand off to your coworkers one by one.
Ultimately my message is that people still need to exercise their best judgment and think for themselves. AI doesn't change what we've come to accept as best practices, it automates and accelerates them. In fact, the models keep getting better the more they are trained on our best practices, so my assertion is that success using AI seems to correlate well with autonomy, creativity, and critical thinking skills.
Anyhow, long answer for a short question — but I hope it helps! And if there's anything unclear: please ask any clarifying questions you may have, or for any additional details that you may find helpful.
As the author of the post I think it was a nice quick post to share my perspective of a behavior I’ve been seeing across many (but not all) developers recently, but I’m always open to feedback for how to improve my writing!
And as I mentioned here (https://news.ycombinator.com/item?id=46392900) I have no affiliation with any of the organizations, nor care to evangelize any of them. Nobody pays me to write, I’m just a guy on the internet sharing his thoughts, building software, and teaching people how to use AI better with any tool people want to use. :)
Heya, author of the post here! I think you're right in everything you've said, but I want to note that the programming language comparison was meant to be metaphorical more than literal. Everything is changing so fast (as I mention in the post a few times), but I have seen some (far from all) people get locked into Claude Code or Codex in a way where they won't even consider alternatives the same way people they chose Ruby to start their career and now identify as Ruby developers.
My goal was to open people's minds just a little bit by saying exactly what you're getting at — everything is moving fast and we should be reassessing often. A meaningful difference is that you can start a codebase with Claude Code and then switch to Codex with almost no friction, while you can't just migrate a TypeScript app to Python in 15 minutes.
Heya, author here! I completely agree with you — and why the post is titled Codex vs. Claude Code (Today). I also have this very specific disclaimer in the second paragraph to note that this post is a reflection of a moment in time. :D
> Before we continue, I need to make a disclaimer: This post is about the Claude Code and Codex, on December 22, 2025. Everything in AI changes so fast that I have almost no expectations about the validity of these statements in a year, or probably even 3-6 months from now.
That said I do what you do and try different models when I want to see if things have changed. I run my own private little benchmarks with a few complex real world tasks, and I really love seeing how things are progressing — both in terms of quality but also the novel quirks that are introduced, changed, or removed. :)
I'll try to answer these one by one, but I will just note that a lot of my prompts are domain specific so it's hard to share those.
- I don't use any plans — my writing is the plan. The Plan Mode in Claude Code is excellent, but as I've switched to Codex (which doesn't have one) I will simply write up a nice long prompt and then add "Please ask any clarifying questions you may have, or for any additional details that you need" — and it works great! I may go back and forth for anywhere from 5-30 minutes depending on what else is needed, but that's basically the experience of using Plan Mode in Claude Code too.
- I've built quite a few recent features for my app Plinky [^1]. I've made a few meaningful contributions to my open source project Boutique [^2] (and have been having AI asynchronously sketch out a large new database relationships feature). I built my new blog and my workshops pages [^3] with Codex as well. Truth is I do practically everything in Codex and Claude Code these days, so I'd have more trouble listing what I haven't built lately.
- Plinky's upcoming Reader Mode is a good example of a prompt that took me two hours, but the feature isn't yet in the app so I'd prefer not to share the prompt. But I can share the first draft of the prompt for Boutique's relationships feature sine that's open source. [^4] I've been experimenting with using ChatGPT Pulse to make progress on it every day (simply by asking it to!), and much to my surprise it's been designing a new API day by day in a way that's far from perfect but certaintly has been very interesting.
The honest truth is that this one did not take two hours and I wrote it on the bus so it's probably not perfect, but the descriptive process is effectively the same. For a feature like Reader Mode you would have to capture more details to scale up to the additional complexity of a domain-specific feature with client and server components, a new download queueing pipeline, amongst other abstractions.
Thanks! Would be interesting to compare with a looser approach, I use shorter prompts that take me a few seconds to write and it's also built some impressive stuff for me, I've mainly been using claude code and a bit of Gemini/Amp/Droid so I'll add Codex to the mix this week. So far I don't really feel the difference much between tool or model.
But the truth is I do build real production features all the time! That just wasn't the focus of this article. :)
[^1]: https://news.ycombinator.com/item?id=46399123