Hacker Newsnew | past | comments | ask | show | jobs | submit | faeyanpiraat's commentslogin

For me, this is entirely true.

I'm progressing with my side projects like I've never before.


Same, I would have given up on them long ago, I no longer code at all now. Why would I when the latest models can do it better, faster and without the human limitations of tiredness, emotional impacts etc.

or the other way around for less safety guardrails?

there must be a ranking of languages by "safety"


Heh, just wait till LLMs fully self train and make up their own language to avoid human safety restraints.

whats z4


what are you using those for ?

Performance and entertainment at events, along with four of their Go2 Pro dogs (the real crowd pleasers)

why is hydraulics junk?

Perhaps because of the potentially slower actuation speed, but you also generally get a lot more power from hydraulics so im not sure one can claim it is junk. Far less acrobatic, but also far more sumo wrestler.

Less fit for impressive youtube videos. I'm sure they have other boring utilities for the technology.

I wash new clothes by themselves for the first time for this reason, it's a good enough solution.

I once had jeans with buttons and first I thought it was terrible, but after 2-3 uses I got used to it, so it's not a big deal I guess.

And about quality there is a youtube channel where they cut shoes in half and rate the quality of the build, and based on the couple of shoes I've checked out, the price and brand rarely correlates with good quality.

As with everything else, when buying expensive longterm items (like a leather boot), it is worth doing some research into which option is the best.


The other side of catcing going off the rails is when it wants to make edits without it reading the context I know would’ve been neccessary for a high quality change.

yeah exactly - it's that confidence without understanding. like it'll make a change that looks reasonable in isolation but breaks an assumption that's only documented three files over, or relies on state that's set up elsewhere. and you can't always tell just from looking at the diff whether it actually understood the full picture or just got lucky. this is why seeing what files it's reading before it makes changes would be super helpful - at least you'd know if it missed something obvious

Looking at it from far is simply making something large from a smaller input, so its kind of like nondeterministic decompression.

What fills the holes are best practices, what can ruin the result is wrong assumptions.

I dont see how full autonomy can work either without checkpoints along the way.


Totally agreed. Those assumptions often compound as well. So the AI makes one wrong decision early in the process and it affects N downstream assumptions. When they finally finish their process they've built the wrong thing. This happens with one process running. Even on latest Opus models I have to babysit and correct and redirect claude code constantly. There's zero chance that 5 claude codes running for hours without my input are going to build the thing I actually need.

And at the end of the day it's not the agents who are accountable for the code running in the production. It's the human engineers.


Actually it works the other way. With multiple agents they can often correct each others mistaken assumptions. Part of the value of this approach is precisely that you do get better results with fewer hallucinated assumptions.

Still makes this change from Anthropic stupid.


The corrective agent has the exact same percentage chance at making the mistake. "Correcting" an assumption that was previously correct into an incorrect one.

If a singular agent has a 1% chance of making an incorrect assumption, then 10 agents have that same 1% chance in aggregate.


You are assuming statistical independence, which is explicitly not correct here. There is also an error in your analysis - what matters is whether they make the same wrong assumption. That is far less likely, and becomes exponentially unlikely with increasing trials.

I can attest that it works well in practice, and my organization is already deploying this technique internally.


How several wrong assumptions make it right with increasing trials?

You can ask Opus 4.6 to do a task and leave it running for 30min or more to attempt one-shooting it. Imagine doing this with three agents in parallel in three separate work trees. Then spin up a new agent to decide which approach of the three is best on the merits. Repeat this analysis in fresh contexts and sample until there is clear consensus on one. If no consensus after N runs, reframe to provide directions for a 4th attempt. Continue until a clear winning approach is found.

This is one example of an orchestration workflow. There are others.


  > Then spin up a new agent to decide which approach of the three is best on the merits. Repeat this analysis in fresh contexts and sample until there is clear consensus on one.
If there are several agents doing analysis of solutions, how do you define a consensus? Should it be unanimous or above some threshold? Are agents scores soft or hard? How threshold is defined if scores are soft? There is a whole lot of science in voting approaches, which voting approach is best here?

Is it possible for analyzing agents to choose the best of wrong solutions? E.g., longest remembered table of FizzBuzz answers amongst remembered tables of FizzBuzz answers.


We have a voting algorithm that we use, but we're not at the level of confidential disclosure if we proceed further in this discussion. There's lots of research out there into unbiased voting algorithms for consensus systems.

You conveniently decided not to answer my question about quality of the solutions to vote on (ranking FizzBuzz memorization).

To me, our discussion shows that what you presented as a simple thing is not simple at all, even voting is complex, and actually getting a good result is so hard it warrants omitting answer altogether.


Yeah, you've got unrealistic expectations if you expect me to divulge my company's confidential IP in a HN comment.

I had no expectations at all, I just asked questions, expecting answers. At the very beginning the tone of your comment, as I read it, was "agentic coding is nothing but simple, look they vote." Now answers to simple but important questions are "confidential IP."

Okay then, agentic coding is nothing but complex task requiring knowledge of unbiased voting (what is this thing really?) and, apparently, use of necessarily heavy test suite and/or theorem provers.


It was a scene from a sci-fi movie (i mean Claude demo to CTOs)

Nonsense. If you have 16 binary decisions that’s 64k possible paths.

These are not independent samplings.

Indeed. Doesn’t that make it worse? Prior decisions will bring up path dependent options ensuring they aren’t even close to the same path.

Run a code review agent, and ask it to identify issues. For each issue, run multiple independent agents to perform independent verification of this issue. There will always be some that concur and some that disagree. But the probability distributions are vastly different for real issues vs hallucinations. If it is a real issue they are more likely to happen upon it. If it is a hallucination, they are more likely to discover the inconsistency on fresh examination.

This is NOT the same as asking “are you sure?” The sycophantic nature of LLMs would make them biased on that. But fresh agents with unbiased, detached framing in the prompt will show behavior that is probabilistically consistent with the underlying truth. Consistent enough for teasing out signal from noise with agent orchestration.


Take a look at the latest Codex on very-high. Claude’s astroturfed IMHO.

Can you explain more? I'm talking about LLM/agent behavior in a generalized sense, even though I used claude code as the example here.

What is Codex doing differently to solve for this problem?


Yeah, but now you know if you need to do math, you ask the AI for a python script to do the math correctly.

It's just a tool that you get better at using over time; a hammer wouldn't complain if you tried using it as a screwdriver..


This hammer/screwdriver analogy drives me crazy. Yes, it's a tool - we used computers up until now to give us correct deterministic responses. Now the religion is that you need to get used to vibe answers, because it's the future :) Of-course it knows the script or formula for something because it ripped of the answers written by other people - it's a great search engine.

There was a different lamp startup article kind of recently, where they talked about this, and if I remember correctly they needed to run the lamp for like 1000 hours straight for it to receive some kind of certification.

I could search for it if you want to read about that.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: