Same, I would have given up on them long ago, I no longer code at all now. Why would I when the latest models can do it better, faster and without the human limitations of tiredness, emotional impacts etc.
Perhaps because of the potentially slower actuation speed, but you also generally get a lot more power from hydraulics so im not sure one can claim it is junk. Far less acrobatic, but also far more sumo wrestler.
I wash new clothes by themselves for the first time for this reason, it's a good enough solution.
I once had jeans with buttons and first I thought it was terrible, but after 2-3 uses I got used to it, so it's not a big deal I guess.
And about quality there is a youtube channel where they cut shoes in half and rate the quality of the build, and based on the couple of shoes I've checked out, the price and brand rarely correlates with good quality.
As with everything else, when buying expensive longterm items (like a leather boot), it is worth doing some research into which option is the best.
The other side of catcing going off the rails is when it wants to make edits without it reading the context I know would’ve been neccessary for a high quality change.
yeah exactly - it's that confidence without understanding. like it'll make a change that looks reasonable in isolation but breaks an assumption that's only documented three files over, or relies on state that's set up elsewhere. and you can't always tell just from looking at the diff whether it actually understood the full picture or just got lucky. this is why seeing what files it's reading before it makes changes would be super helpful - at least you'd know if it missed something obvious
Totally agreed. Those assumptions often compound as well. So the AI makes one wrong decision early in the process and it affects N downstream assumptions. When they finally finish their process they've built the wrong thing. This happens with one process running. Even on latest Opus models I have to babysit and correct and redirect claude code constantly. There's zero chance that 5 claude codes running for hours without my input are going to build the thing I actually need.
And at the end of the day it's not the agents who are accountable for the code running in the production. It's the human engineers.
Actually it works the other way. With multiple agents they can often correct each others mistaken assumptions. Part of the value of this approach is precisely that you do get better results with fewer hallucinated assumptions.
The corrective agent has the exact same percentage chance at making the mistake. "Correcting" an assumption that was previously correct into an incorrect one.
If a singular agent has a 1% chance of making an incorrect assumption, then 10 agents have that same 1% chance in aggregate.
You are assuming statistical independence, which is explicitly not correct here. There is also an error in your analysis - what matters is whether they make the same wrong assumption. That is far less likely, and becomes exponentially unlikely with increasing trials.
I can attest that it works well in practice, and my organization is already deploying this technique internally.
You can ask Opus 4.6 to do a task and leave it running for 30min or more to attempt one-shooting it. Imagine doing this with three agents in parallel in three separate work trees. Then spin up a new agent to decide which approach of the three is best on the merits. Repeat this analysis in fresh contexts and sample until there is clear consensus on one. If no consensus after N runs, reframe to provide directions for a 4th attempt. Continue until a clear winning approach is found.
This is one example of an orchestration workflow. There are others.
> Then spin up a new agent to decide which approach of the three is best on the merits. Repeat this analysis in fresh contexts and sample until there is clear consensus on one.
If there are several agents doing analysis of solutions, how do you define a consensus? Should it be unanimous or above some threshold? Are agents scores soft or hard? How threshold is defined if scores are soft? There is a whole lot of science in voting approaches, which voting approach is best here?
Is it possible for analyzing agents to choose the best of wrong solutions? E.g., longest remembered table of FizzBuzz answers amongst remembered tables of FizzBuzz answers.
We have a voting algorithm that we use, but we're not at the level of confidential disclosure if we proceed further in this discussion. There's lots of research out there into unbiased voting algorithms for consensus systems.
You conveniently decided not to answer my question about quality of the solutions to vote on (ranking FizzBuzz memorization).
To me, our discussion shows that what you presented as a simple thing is not simple at all, even voting is complex, and actually getting a good result is so hard it warrants omitting answer altogether.
I had no expectations at all, I just asked questions, expecting answers. At the very beginning the tone of your comment, as I read it, was "agentic coding is nothing but simple, look they vote." Now answers to simple but important questions are "confidential IP."
Okay then, agentic coding is nothing but complex task requiring knowledge of unbiased voting (what is this thing really?) and, apparently, use of necessarily heavy test suite and/or theorem provers.
Run a code review agent, and ask it to identify issues. For each issue, run multiple independent agents to perform independent verification of this issue. There will always be some that concur and some that disagree. But the probability distributions are vastly different for real issues vs hallucinations. If it is a real issue they are more likely to happen upon it. If it is a hallucination, they are more likely to discover the inconsistency on fresh examination.
This is NOT the same as asking “are you sure?” The sycophantic nature of LLMs would make them biased on that. But fresh agents with unbiased, detached framing in the prompt will show behavior that is probabilistically consistent with the underlying truth. Consistent enough for teasing out signal from noise with agent orchestration.
This hammer/screwdriver analogy drives me crazy. Yes, it's a tool - we used computers up until now to give us correct deterministic responses. Now the religion is that you need to get used to vibe answers, because it's the future :)
Of-course it knows the script or formula for something because it ripped of the answers written by other people - it's a great search engine.
There was a different lamp startup article kind of recently, where they talked about this, and if I remember correctly they needed to run the lamp for like 1000 hours straight for it to receive some kind of certification.
I could search for it if you want to read about that.
I'm progressing with my side projects like I've never before.
reply