bhu8's comments

bhu8 · 2025-09-26T15:16:59 1758899819

IMHO, jumping from Level 2 to Level 5 is a matter of:

- Better structured codebases - we need hierarchical codebases with minimal depth, maximal orthogonality and reasonable width. Think microservices.

- Better documentation - most code documentations are not built to handle updates. We need a proper graph structure with few sources of truth that get propagated downstream. Again, some optimal sort of hierarchy is crucial here.

At this point, I really don't think that we necessarily need better agents.

Setup your codebase optimally, spin up 5-10 instances of gpt-5-codex-high for each issue/feature/refactor (pick the best according to some criteria) and your life will go smoothly

skaosobab · 2025-09-26T15:32:36 1758900756

> Think microservices.

Microservices should already be a last resort when you’ve either: a) hit technical scale that necessitates it b) hit organizational complexity that necessitates it

Opting to introduce them sooner will almost certainly increase the complexity of your codebase prematurely (already a hallmark of LLM development).

> Better documentation

If this means reasoning as to why decisions are made then yes. If this means explaining the code then no - code is the best documentation. English is nowhere near as good at describing how to interface with computers.

Given how long gpt codex 5 has been out, there’s no way you’ve followed these practices for a reasonable enough time to consider them definitive (2 years at the least, likely much longer).

bhu8 · 2025-09-26T15:39:57 1758901197

> Opting to introduce them sooner will almost certainly increase the complexity of your codebase prematurely

Agreed, but how else are you going to scale mostly AI written code? Relying mostly on AI agents gives you that organizational complexity.

> Given how long gpt codex 5 has been out, there’s no way you’ve followed these practices for a reasonable enough time to consider them definitive

Yeah, fair. Codex has been out for less than 2 weeks at this point. I was relying on gpt-5 in August and opus before that.

lomase · 2025-09-26T15:43:55 1758901435

I understand why you made it microservices, people make that too even when not using LLMs, because it looks like it is more organized.

But in my experience a microservide architecture is orders of magnitud more complex to build and understand that a monolith.

If you, with the help of an LLM, strugle to keep a monolith organized, I am positive you will find even harder to build microservices.

Good luck in your journey, I hope you learn a ton!

bhu8 · 2025-09-26T15:53:24 1758902004

Noted. Thanks!

lomase · 2025-09-26T15:18:12 1758899892

Can you show something you have built with that workflow?

bhu8 · 2025-09-26T15:32:41 1758900761

Not yet unfortunately, but I'm in the process of building one.

This was my journey: I vibe-coded an Electron app and ended up with a terrible monolithic architecture, and mostly badly written code. Then, I took the app's architecture docs and spent a lot of my time shouting "MAKE THIS ARCHITECTURE MORE ORTHOGONAL, SOLID, KISS, DRY" to gpt-5-pro, and ended up with a 1500+ liner monster doc.

I'm now turning this into a Tauri app and following the new architecture to a T. I would say that it is has a pretty clean structure with multiple microservices.

Now, new features are gated based on the architecture doc, so I'm always maintaining a single source of truth that serves as the main context for any new discussions/features. Also, each microservice has its own README file(s) which are updated with each code change.

RedNifre · 2025-09-26T15:51:57 1758901917

I vibe coded an invoice generator by first vibe coding a "template" command line tool as a bash script that substitutes {{words}} in a libre office writer document (those are just zipped xml files, so you can unpack them to a temp directory and substitute raw text without xml awareness), and in the end it calls libre office's cli to convert it to pdf. I also asked the AI to generate a documentation text file, so that the next AI conversation could use the command as a black box.

The vibe coded main invoice generator script then does the calendar calculations to figure out the pay cycle and examines existing invoices in the invoice directory to determine the next invoice number (the invoice number is in the file name, so it doesn't need to open the files). When it is done with the calculations, it uses the template command to generate the final invoice.

This is a very small example, but I do think that clearly defined modules/microservices/libraries are a good way to only put the relevant work context into the limited context window.

It also happens to be more human-friendly, I think?

whstl · 2025-09-26T17:13:40 1758906820

I "vibe coded" a Gateway/Proxy server that did a lot of request enrichment and proprietary authz stuff that was previously in AWS services. The goal was to save money by having a couple high-performance servers instead of relying on cloud-native stuff.

I put "vibe coded" is in quotes because the code was heavily reviewed after the process, I helped when the agent got stuck (I know pedants will complain but ), and this was definitely not my first rodeo in this domain and I just wanted to see how far an agent could go.

In the end it had a few modifications and went into prod, but to be really fair it was actually fine!

One thing I vibe coded 100% and barely looked at the code until the end was a MacOS menubar app that shows some company stats. I wanted it in Swift but WITHOUT Xcode. It was super helpful in that regard.

hirako2000 · 2025-09-26T15:27:39 1758900459

Of course not.

perplex · 2025-09-26T15:36:31 1758900991

I've been using claude on two codebases, one with good layering and clean examples, the other not so much. I get better output from the LLM with good context and clean examples and documentation. Not surprising that clarity in code benefits both humans and machines.

daxfohl · 2025-09-26T16:48:23 1758905303

I think there will be a couple benefits of using agents soon. Should result in a more consistent codebase, which will make patterns easier to see and work with, and also less reinventing the wheel. Also migrations should be way faster both within and across teams, so a lot less struggling with maintaining two ways of doing something for years, which again leads to simpler and more consistent code. Finally the increased speed should lead to more serializability of feature additions, so fewer problems trying to coordinate changes happening in parallel, conflicts, redundancies, etc.

I imagine over time we'll restructure the way we work to take advantage of these opportunities and get a self-reinforcing productivity boost that makes things much simpler, though agents aren't quite capable enough for that breakthrough yet.

bhu8 · 2025-06-21T11:32:53 1750505573

I'm Schmidhuber neutral, but the word on the street is that he is a major asshole and sometimes impossible to work with. His research might be more solid than the Turing award winners but his personality truly kept him behind.

bhu8 · 2025-03-31T16:15:56 1743437756

I have been thinking about the exact same problem for a while and was literally hours away from publishing a blogpost on the subject.

+100 on the footnote:

> agents or workflows?

Workflows. Workflows, all the way.

The agents can start using these workflows once they are actually ready to execute stuff with high precision. And, by then we would have figured out how to create effective, accurate and easily diagnozable workflows, so people will stop complaining about "I want to know what's going on inside the black box".

DebtDeflation · 2025-03-31T20:28:02 1743452882

I've been building workflows with "AI" capability inserted where appropriate since 2016. Mostly customer service chatbots.

99.9% of real world enterprise AI use cases today are for workflows not agents.

However, "agents" are being pushed because the industry needs a next big thing to keep the investment funding flowing in.

The problem is that even the best reasoning models available today don't have the actual reasoning and planning capability needed to build truly autonomous agents. They might in a year. Or they might not.

breckenedge · 2025-03-31T17:35:25 1743442525

Agreed, I started crafting workflows last week. Still not impressed with how poorly the current crop of models is at following instructions.

And are there any guidelines on how to manage workflows for a project or set of projects? I’m just keeping them in plain text and including them in conversations ad hoc.

bhu8 · 2025-03-31T03:21:39 1743391299

I'm sold. How do I play?

999900000999 · 2025-03-31T04:14:35 1743394475

Wait 3 months for me to finish.

So far it's basically just a Django server. You're responsible for self hosting ( although I imagine I'll put up a demo server), you can define your own cards.

You can play the game by literally just using curl calls if you want to be hardcore about it.

I *might* make a nice closed source game client one day with a bunch of cool effects, but that's not my focus right now.

bhu8 · 2025-02-24T18:29:27 1740421767

Also available on claude.ai

bhu8 · 2025-01-31T18:40:14 1738348814

Yeah, the more pages I read, the more disappointed I became. Here is the reason they cite for the low performance (which is even more worrying):

"The model often attempts to use a hallucinated bash tool rather than python despite constant, multi-shot prompting and feedback that this format is incorrect. This resulted in long conversations that likely hurt its performance."

tippytippytango · 2025-01-31T18:48:30 1738349310

Good to know openai knows the frustration of trying to argue with their RL based models as well.

eightysixfour · 2025-01-31T18:47:13 1738349233

aider found that with R1, the best performance was to use R1 to think through the solution, and use claude to implement the solution. I suspect that, in the near term, we'll need combinations of reasoning models and instruction-following coding models for excellent code output.

My experience is that most of the models focused on reasoning improvements has been that they tend to be a bit worse at following specific instructions. It is also notable that a lot of 3rd party fine-tunes of Llamas and others gain in knowledge based benchmarks while reducing instruction following scores.

I wonder why that seems to be some sort of continuum?

bn-l · 2025-01-31T18:58:50 1738349930

Kind of like an ai “thinking fast and thinking slow”.

eightysixfour · 2025-01-31T19:05:29 1738350329

Sort of? I don't see why thinking slow should inhibit the ability to follow instructions.

Arcuru · 2025-01-31T19:36:07 1738352167

I think they're referencing "Thinking, Fast and Slow" - https://en.wikipedia.org/wiki/Thinking,_Fast_and_Slow

"The book's main thesis is a differentiation between two modes of thought: "System 1" is fast, instinctive and emotional; "System 2" is slower, more deliberative, and more logical. "

eightysixfour · 2025-01-31T20:40:23 1738356023

Yes, I understand the reference. I don't understand their argument that this is a good example of that common mental model for LLMs.

In this case "fast, instinctive, and emotional" models are better at instruction following than "slower, more deliberative, and more logical" models.

bhu8 · on Oct 5, 2021

Nice thread on Giorgio Parisi by Montanari: https://twitter.com/Andrea__M/status/1445405295811960841?s=2...