More

SatvikBeri · 2026-02-18T06:53:18 1771397598

At work I'll buy a max subscription for anyone on my team who wants it. If it saves 1-2 hours a month it's worth it, and people get that even if they only use the LLMs to search the codebase. And the frontier models are noticeably better than others, still.

At home I have a $20/month subscription and that's covered everything I need so far. If I wanted to do more at home, I'd seriously look into the open weight models.

SatvikBeri · 2026-02-17T20:31:38 1771360298

I've also seen Opus 4.6 as a pure upgrade. In particular, it's noticeably better at debugging complex issues and navigating our internal/custom framework.

drcongo · 2026-02-17T20:39:58 1771360798

Same here. 4.6 has been considerably more dilligent for me.

AustinDev · 2026-02-17T20:45:27 1771361127

Likewise, I feel like it's degraded in performance a bit over the last couple weeks but that's just vibes. They surely vary thinking tokens based on load on the backend, especially for subscription users.

When my subscription 4.6 is flagging I'll switch over to Corporate API version and run the same prompts and get a noticeably better solution. In the end it's hard to compare nondeterministic systems.

merlindru · 2026-02-18T00:35:34 1771374934

That's very interesting!

Also, +1. Opus 4.6 is strictly better than 4.5 for me

SatvikBeri · 2026-02-14T00:07:49 1771027669

Yeah, this is my main way of using Claude Code for anything complex – a REPL or bash window in tmux, and with Claude running commands there. That lets me easily browse through anything that's happened in a UI I'm used to, or manually intervene if needed.

SatvikBeri · 2026-02-12T22:36:01 1770935761

I've heard this said a lot but never had this problem. Claude has been decent at debugging tests since 4.0 in my experience (and much better since 4.5)

SatvikBeri · 2026-02-12T17:32:13 1770917533

We can measure this by looking at the same harness applied to different models, e.g. the very plain Terminus: https://www.tbench.ai/leaderboard/terminal-bench/2.0?agents=...

Models have improved dramatically even with the same harness

jwpapi · 2026-02-16T16:12:43 1771258363

I mean that just the way it tackles task in the core is generated differently, like inner harness, through system prompt or deeper root. F.e. Instead of answering instantly it goes through a pre-defined steps which strategy should be done, split task, use thinking tokens, use tools etc.

SatvikBeri · 2026-02-11T20:34:21 1770842061

Bezos's API memo is the biggest example I can think of. It was not individually productive for teams but arguably it was very productive for Amazon/AWS as a whole.

MontyCarloHall · 2026-02-11T20:38:43 1770842323

That's a top-down organizational change mandating certain capabilities for all Amazon software. Unlike these AI mandates, it's not dictating the exact tools developers use to write that software, but rather what the software itself should do. For reference, here's the API mandate [0]:

   1. All teams will henceforth expose their data and functionality through service interfaces.

   2. Teams must communicate with each other through these interfaces.

   3. There will be no other form of interprocess communication allowed: no direct linking, no direct reads of another team’s data store, no shared-memory model, no back-doors whatsoever. The only communication allowed is via service interface calls over the network.

   4. It doesn’t matter what technology they use. HTTP, Corba, Pubsub, custom protocols — doesn’t matter.

   5. All service interfaces, without exception, must be designed from the ground up to be externalizable. That is to say, the team must plan and design to be able to expose the interface to developers in the outside world. No exceptions.

   6. Anyone who doesn’t do this will be fired.

This is very different from saying "any developer who doesn't use an IDE and a debugger will be fired," which is analogous to what the AI mandates are prescribing.

[0] https://nordicapis.com/the-bezos-api-mandate-amazons-manifes...

SatvikBeri · 2026-02-11T19:34:48 1770838488

Cursor performs notably worse for me on my medium-sized codebase (~500kloc), possibly because they try to aggressively conserve context. This is especially true for debugging, Claude Code will read dozens of files and do a surprisingly good job of finding complex bugs, while Cursor seems to just respond with the first hypothesis it comes up with.

That said, Cursor Composer is a lot faster and really nice for some tasks that don't require lots of context.

SatvikBeri · 2026-02-08T13:44:24 1770558264

I do this. For example, the other day I made a commit where I renamed some fields of a struct and removed others, then I realized it would be easier to review if those were two separate commits. But it was hard to split them out mechanically, so I asked Claude to do it, creating two new commits whose end result must match the old one and must both past tests. It works quite well.

SatvikBeri · 2026-02-08T00:29:55 1770510595

I use one Claude instance at a time, roughly fulltime (writes 90% of my code). Generally making small changes, nothing weird. According to ccusage, I spend about $20 of tokens a day, a bit less than 1 MTOK output tokens a way. So the exact same workflow would be about $120 for higher speed.

SatvikBeri · 2026-02-06T12:31:55 1770381115

What made me feel old today: seeing a 36-year-old referred to as an older type