More

wayy · 2025-11-07T00:08:51 1762474131

everybody loves building agents, nobody likes debugging them. agents hit the classic llm app lifecycle problem: at first it feels magical. it nails the first few tasks, doing things you didn’t even think were possible. you get excited, start pushing it further. you run it and then it fails on step 17, then 41, then step 9.

now you can’t reproduce it because it’s probabilistic. each step takes half a second, so you sit there for 10–20 minutes just waiting for a chance to see what went wrong

furyofantares · 2025-11-07T00:20:52 1762474852

That's why you build extensive tooling to run your change hundreds of times in parallel against the context you're trying to fix, and then re-run hundreds of past scenarios in parallel to verify none of them breaks.

AdieuToLogic · 2025-11-07T02:19:47 1762481987

In the event this comment is slathered in sarcasm:

  Well done!  :-D

ht96 · 2025-11-07T00:48:21 1762476501

Do you use a tool for this? Is there some sort of tool which collects evals from live inferences (especially those which fail)

AdieuToLogic · 2025-11-07T02:27:08 1762482428

There is no way to prove the correctness of non-deterministic (a.k.a. probabilistic) results for any interesting generative algorithm. All one can do is validate against a known set of tests, with the understanding that the set is unbounded over time.

cantor_S_drug · 2025-11-07T08:39:15 1762504755

https://x.com/rerundotio/status/1968806896959402144

This is a use of Rerun that I haven't seen before!

This is pretty fascinating!!!

Typically people use Rerun to visualize robotics data - if I'm following along correctly... what's fascinating here is that Adam for his master's thesis is using Rerun to visualize Agent (like ... software / LLM Agent) state.

Interesting use of Rerun!

https://github.com/gustofied/P2Engine

aenis · 2025-11-07T05:48:55 1762494535

For sure, for instance Google has ADK Eval framework. You write tests, and you can easily run them against given input. I'd say its a bit unpolished, as is the rest of the rapidly developing ADK framework, but it does exist.

saturatedfat · 2025-11-07T07:46:18 1762501578

heya, building this. been used in prod for a month now, has saved my customer’s ass while building general workflow automation agents. happy to chat if ur interested.

darin@mcptesting.com

(gist: evals as a service)

tptacek · 2025-11-07T18:36:31 1762540591

That everybody seems to love building these things while people like you harbor deep skepticism about them is a reason to get your hands dirty with an agent, because the cost of doing that is 30-45 minutes of your time, and doing so will arm you with an understanding you can use to make better arguments against them.

For the problem domains I care about at the moment, I'm quite bullish about agents. I think they're going to be huge wins for vulnerability analysis and for operations/SRE work (not actually turning dials, but in making telemetry more interpretable). There are lots of domains where I'm less confident in them. But you could reasonably call me an optimist.

But the point of the article is that its arguments work both ways.

wayy · on Dec 9, 2022

Thanks for pointing this out - we didn't consider how important stackoverflow comments could be. Will look into including as a source.

ducktective · on Dec 9, 2022

Awesome product btw!

wayy · on Dec 8, 2022

Thanks for trying it out!

We can definitely be clearer on our focus for programming related queries. We usually don't display code snippets for non-programming questions but definitely still tuning a couple things there.

We're not focused on simple factoid answers like population of cities because that's not where people get the most value.

AWS API is a bit tricky because it is a rather broad technology with SDKs in different languages and thus the search results for a question will return a mishmash of solutions which we then try to make sense of. If you share some sample queries you tried related to this, I'd be happy to look into them and improve our answers there.

Business model: We're currently just focused on building something developers want. Agree that ads and dev tools aren't the most synergetic.

wayy · on Dec 8, 2022

The answer is based on the contents of the websites returned by the search engine + other some sources.

wayy · on Dec 8, 2022

What is the correct/expected answer to that query?

wayy · on Dec 8, 2022

We think the solution is simply having good sources and answer transparency. If you mouse over part of the answer, we try to show you the source of that sentence. Obviously this system is early and will improve over time, but if can easily check if an answer is from say, the FastAPI official documentation, then the false-confidence effect of these models become less of an issue.

wayy · on Dec 8, 2022

Interesting, we are trying to wrangle with the nondeterminism but sometimes it can't be helped as Bing itself can produce different results. Always actively working on the model though.

The citing sources feature can definitely be improved - right now it works at a sentence granularity and insists on finding the best source when sometimes not appropriate. Thanks for pointing all of this out

wayy · on Dec 8, 2022

If you can't find the information yourself via a web search, it's going to be difficult for the model to find the answer too... for now ;)

wayy · on Dec 8, 2022

You can leave feedback on individual search results via the check and X buttons below the answer. Leaving feedback e.g votes on each code snippet is on the roadmap.

gavinray · on Dec 8, 2022

Right but that's just a blanket "good" or "bad"

I hit the green check, but I wanted to leave a detailed comment

wayy · on Dec 8, 2022

Ah we only put the option for detailed comment on negative feedback (try clicking the x and a form will pop up). Will also give that option for positive feedback in the future.

gavinray · on Dec 8, 2022

Oh, got it, well at least there's a way to leave it!

wayy · on Dec 8, 2022

Thanks for trying it out - we're still quite early so the model isn't going to be perfect + we're focused on programming related queries at the moment. What is your use case? Are most of your searches related to history?

trynewideas · on Dec 8, 2022

Most of my searches are related to researching and fact-checking potentially spurious statements, like the ones that Hello apparently produces.

This particular query about the French revolution seems to give GPT fits across its iterations, and I suspect it's because:

- the most correct answer is "I don't know"; when answering questions, GPT isn't really trying to answer a question by reasoning through it, it's trying to mimic what usually happens after someone asks a question, and especially online that usually isn't someone saying "I don't know"

- the most authoritative sources on subjects like this include books, and GPT doesn't seem to read a lot of those, or if it has then it doesn't seem to be able to cite them without inventing false details about the sources themselves (titles that don't exist, attributing works to the wrong authors)

Seeing "model that cites sources" made me excited that there was a solution to the second point. But specific to Hello:

- the most authoritative web sources are historians who cite sources, which aren't often written for SEO, on popular platforms, or not paywalled

- a search result being in a first page of Bing/Google doesn't make it authoritative

- this doesn't seem to stop GPT from coming up with false inventions/hallucinations that aren't in or relevant to the cited source at all

wayy · on Dec 8, 2022

You're right - there are so many good sources not often surfaced by a cursory web search like books and experts. Adding better sources is something we're improving on. A simple next step we're looking into is to expand sources to research papers e.g explicitly pulling from arxiv on some queries.