More

faeyanpiraat · 2026-02-20T12:17:17 1771589837

For me, this is entirely true.

I'm progressing with my side projects like I've never before.

small_model · 2026-02-20T12:33:52 1771590832

Same, I would have given up on them long ago, I no longer code at all now. Why would I when the latest models can do it better, faster and without the human limitations of tiredness, emotional impacts etc.

faeyanpiraat · 2026-02-19T14:03:08 1771509788

or the other way around for less safety guardrails?

there must be a ranking of languages by "safety"

pixl97 · 2026-02-19T15:16:24 1771514184

Heh, just wait till LLMs fully self train and make up their own language to avoid human safety restraints.

faeyanpiraat · 2026-02-19T09:56:07 1771494967

whats z4

RobertoG · 2026-02-19T11:38:02 1771501082

This solver, I suppose:

https://pypi.org/project/z4-solver/

faeyanpiraat · 2026-02-19T09:54:38 1771494878

what are you using those for ?

simonjgreen · 2026-02-22T08:17:05 1771748225

Performance and entertainment at events, along with four of their Go2 Pro dogs (the real crowd pleasers)

faeyanpiraat · 2026-02-19T09:50:08 1771494608

why is hydraulics junk?

AngryData · 2026-02-19T10:07:44 1771495664

Perhaps because of the potentially slower actuation speed, but you also generally get a lot more power from hydraulics so im not sure one can claim it is junk. Far less acrobatic, but also far more sumo wrestler.

tencentshill · 2026-02-19T19:47:02 1771530422

Less fit for impressive youtube videos. I'm sure they have other boring utilities for the technology.

faeyanpiraat · 2026-02-19T09:16:56 1771492616

I wash new clothes by themselves for the first time for this reason, it's a good enough solution.

I once had jeans with buttons and first I thought it was terrible, but after 2-3 uses I got used to it, so it's not a big deal I guess.

And about quality there is a youtube channel where they cut shoes in half and rate the quality of the build, and based on the couple of shoes I've checked out, the price and brand rarely correlates with good quality.

As with everything else, when buying expensive longterm items (like a leather boot), it is worth doing some research into which option is the best.

faeyanpiraat · 2026-02-16T13:30:52 1771248652

The other side of catcing going off the rails is when it wants to make edits without it reading the context I know would’ve been neccessary for a high quality change.

the_harpia_io · 2026-02-16T15:07:11 1771254431

yeah exactly - it's that confidence without understanding. like it'll make a change that looks reasonable in isolation but breaks an assumption that's only documented three files over, or relies on state that's set up elsewhere. and you can't always tell just from looking at the diff whether it actually understood the full picture or just got lucky. this is why seeing what files it's reading before it makes changes would be super helpful - at least you'd know if it missed something obvious

faeyanpiraat · 2026-02-16T13:28:41 1771248521

Looking at it from far is simply making something large from a smaller input, so its kind of like nondeterministic decompression.

What fills the holes are best practices, what can ruin the result is wrong assumptions.

I dont see how full autonomy can work either without checkpoints along the way.

rco8786 · 2026-02-16T13:40:25 1771249225

Totally agreed. Those assumptions often compound as well. So the AI makes one wrong decision early in the process and it affects N downstream assumptions. When they finally finish their process they've built the wrong thing. This happens with one process running. Even on latest Opus models I have to babysit and correct and redirect claude code constantly. There's zero chance that 5 claude codes running for hours without my input are going to build the thing I actually need.

And at the end of the day it's not the agents who are accountable for the code running in the production. It's the human engineers.

adastra22 · 2026-02-16T14:20:42 1771251642

Actually it works the other way. With multiple agents they can often correct each others mistaken assumptions. Part of the value of this approach is precisely that you do get better results with fewer hallucinated assumptions.

Still makes this change from Anthropic stupid.

rco8786 · 2026-02-16T14:30:57 1771252257

The corrective agent has the exact same percentage chance at making the mistake. "Correcting" an assumption that was previously correct into an incorrect one.

If a singular agent has a 1% chance of making an incorrect assumption, then 10 agents have that same 1% chance in aggregate.

adastra22 · 2026-02-16T15:38:41 1771256321

You are assuming statistical independence, which is explicitly not correct here. There is also an error in your analysis - what matters is whether they make the same wrong assumption. That is far less likely, and becomes exponentially unlikely with increasing trials.

I can attest that it works well in practice, and my organization is already deploying this technique internally.

thesz · 2026-02-16T16:37:29 1771259849

How several wrong assumptions make it right with increasing trials?

adastra22 · 2026-02-16T16:49:47 1771260587

You can ask Opus 4.6 to do a task and leave it running for 30min or more to attempt one-shooting it. Imagine doing this with three agents in parallel in three separate work trees. Then spin up a new agent to decide which approach of the three is best on the merits. Repeat this analysis in fresh contexts and sample until there is clear consensus on one. If no consensus after N runs, reframe to provide directions for a 4th attempt. Continue until a clear winning approach is found.

This is one example of an orchestration workflow. There are others.

thesz · 2026-02-16T17:32:10 1771263130

  > Then spin up a new agent to decide which approach of the three is best on the merits. Repeat this analysis in fresh contexts and sample until there is clear consensus on one.

If there are several agents doing analysis of solutions, how do you define a consensus? Should it be unanimous or above some threshold? Are agents scores soft or hard? How threshold is defined if scores are soft? There is a whole lot of science in voting approaches, which voting approach is best here?

Is it possible for analyzing agents to choose the best of wrong solutions? E.g., longest remembered table of FizzBuzz answers amongst remembered tables of FizzBuzz answers.

adastra22 · 2026-02-16T20:31:29 1771273889

We have a voting algorithm that we use, but we're not at the level of confidential disclosure if we proceed further in this discussion. There's lots of research out there into unbiased voting algorithms for consensus systems.

thesz · 2026-02-16T22:19:34 1771280374

You conveniently decided not to answer my question about quality of the solutions to vote on (ranking FizzBuzz memorization).

To me, our discussion shows that what you presented as a simple thing is not simple at all, even voting is complex, and actually getting a good result is so hard it warrants omitting answer altogether.

adastra22 · 2026-02-17T04:44:28 1771303468

Yeah, you've got unrealistic expectations if you expect me to divulge my company's confidential IP in a HN comment.

thesz · 2026-02-17T07:16:40 1771312600

I had no expectations at all, I just asked questions, expecting answers. At the very beginning the tone of your comment, as I read it, was "agentic coding is nothing but simple, look they vote." Now answers to simple but important questions are "confidential IP."

Okay then, agentic coding is nothing but complex task requiring knowledge of unbiased voting (what is this thing really?) and, apparently, use of necessarily heavy test suite and/or theorem provers.

democracy · 2026-02-16T19:59:05 1771271945

It was a scene from a sci-fi movie (i mean Claude demo to CTOs)

groundzeros2015 · 2026-02-16T14:37:05 1771252625

Nonsense. If you have 16 binary decisions that’s 64k possible paths.

adastra22 · 2026-02-16T15:35:40 1771256140

These are not independent samplings.

groundzeros2015 · 2026-02-16T15:47:00 1771256820

Indeed. Doesn’t that make it worse? Prior decisions will bring up path dependent options ensuring they aren’t even close to the same path.

adastra22 · 2026-02-16T16:41:53 1771260113

Run a code review agent, and ask it to identify issues. For each issue, run multiple independent agents to perform independent verification of this issue. There will always be some that concur and some that disagree. But the probability distributions are vastly different for real issues vs hallucinations. If it is a real issue they are more likely to happen upon it. If it is a hallucination, they are more likely to discover the inconsistency on fresh examination.

This is NOT the same as asking “are you sure?” The sycophantic nature of LLMs would make them biased on that. But fresh agents with unbiased, detached framing in the prompt will show behavior that is probabilistically consistent with the underlying truth. Consistent enough for teasing out signal from noise with agent orchestration.

peyton · 2026-02-16T14:22:52 1771251772

Take a look at the latest Codex on very-high. Claude’s astroturfed IMHO.

rco8786 · 2026-02-16T14:32:58 1771252378

Can you explain more? I'm talking about LLM/agent behavior in a generalized sense, even though I used claude code as the example here.

What is Codex doing differently to solve for this problem?

faeyanpiraat · 2026-02-16T10:10:15 1771236615

Yeah, but now you know if you need to do math, you ask the AI for a python script to do the math correctly.

It's just a tool that you get better at using over time; a hammer wouldn't complain if you tried using it as a screwdriver..

janlukacs · 2026-02-16T10:38:57 1771238337

This hammer/screwdriver analogy drives me crazy. Yes, it's a tool - we used computers up until now to give us correct deterministic responses. Now the religion is that you need to get used to vibe answers, because it's the future :) Of-course it knows the script or formula for something because it ripped of the answers written by other people - it's a great search engine.

faeyanpiraat · 2026-02-04T09:52:36 1770198756

There was a different lamp startup article kind of recently, where they talked about this, and if I remember correctly they needed to run the lamp for like 1000 hours straight for it to receive some kind of certification.

I could search for it if you want to read about that.