everybody loves building agents, nobody likes debugging them. agents hit the classic llm app lifecycle problem: at first it feels magical. it nails the first few tasks, doing things you didn’t even think were possible. you get excited, start pushing it further. you run it and then it fails on step 17, then 41, then step 9.
now you can’t reproduce it because it’s probabilistic. each step takes half a second, so you sit there for 10–20 minutes just waiting for a chance to see what went wrong
That's why you build extensive tooling to run your change hundreds of times in parallel against the context you're trying to fix, and then re-run hundreds of past scenarios in parallel to verify none of them breaks.
There is no way to prove the correctness of non-deterministic (a.k.a. probabilistic) results for any interesting generative algorithm. All one can do is validate against a known set of tests, with the understanding that the set is unbounded over time.
This is a use of Rerun that I haven't seen before!
This is pretty fascinating!!!
Typically people use Rerun to visualize robotics data - if I'm following along correctly... what's fascinating here is that Adam for his master's thesis is using Rerun to visualize Agent (like ... software / LLM Agent) state.
For sure, for instance Google has ADK Eval framework. You write tests, and you can easily run them against given input. I'd say its a bit unpolished, as is the rest of the rapidly developing ADK framework, but it does exist.
heya, building this. been used in prod for a month now, has saved my customer’s ass while building general workflow automation agents. happy to chat if ur interested.
That everybody seems to love building these things while people like you harbor deep skepticism about them is a reason to get your hands dirty with an agent, because the cost of doing that is 30-45 minutes of your time, and doing so will arm you with an understanding you can use to make better arguments against them.
For the problem domains I care about at the moment, I'm quite bullish about agents. I think they're going to be huge wins for vulnerability analysis and for operations/SRE work (not actually turning dials, but in making telemetry more interpretable). There are lots of domains where I'm less confident in them. But you could reasonably call me an optimist.
But the point of the article is that its arguments work both ways.
We can definitely be clearer on our focus for programming related queries. We usually don't display code snippets for non-programming questions but definitely still tuning a couple things there.
We're not focused on simple factoid answers like population of cities because that's not where people get the most value.
AWS API is a bit tricky because it is a rather broad technology with SDKs in different languages and thus the search results for a question will return a mishmash of solutions which we then try to make sense of. If you share some sample queries you tried related to this, I'd be happy to look into them and improve our answers there.
Business model: We're currently just focused on building something developers want. Agree that ads and dev tools aren't the most synergetic.
We think the solution is simply having good sources and answer transparency. If you mouse over part of the answer, we try to show you the source of that sentence. Obviously this system is early and will improve over time, but if can easily check if an answer is from say, the FastAPI official documentation, then the false-confidence effect of these models become less of an issue.
Interesting, we are trying to wrangle with the nondeterminism but sometimes it can't be helped as Bing itself can produce different results. Always actively working on the model though.
The citing sources feature can definitely be improved - right now it works at a sentence granularity and insists on finding the best source when sometimes not appropriate. Thanks for pointing all of this out
You can leave feedback on individual search results via the check and X buttons below the answer. Leaving feedback e.g votes on each code snippet is on the roadmap.
Ah we only put the option for detailed comment on negative feedback (try clicking the x and a form will pop up). Will also give that option for positive feedback in the future.
Thanks for trying it out - we're still quite early so the model isn't going to be perfect + we're focused on programming related queries at the moment. What is your use case? Are most of your searches related to history?
Most of my searches are related to researching and fact-checking potentially spurious statements, like the ones that Hello apparently produces.
This particular query about the French revolution seems to give GPT fits across its iterations, and I suspect it's because:
- the most correct answer is "I don't know"; when answering questions, GPT isn't really trying to answer a question by reasoning through it, it's trying to mimic what usually happens after someone asks a question, and especially online that usually isn't someone saying "I don't know"
- the most authoritative sources on subjects like this include books, and GPT doesn't seem to read a lot of those, or if it has then it doesn't seem to be able to cite them without inventing false details about the sources themselves (titles that don't exist, attributing works to the wrong authors)
Seeing "model that cites sources" made me excited that there was a solution to the second point. But specific to Hello:
- the most authoritative web sources are historians who cite sources, which aren't often written for SEO, on popular platforms, or not paywalled
- a search result being in a first page of Bing/Google doesn't make it authoritative
- this doesn't seem to stop GPT from coming up with false inventions/hallucinations that aren't in or relevant to the cited source at all
You're right - there are so many good sources not often surfaced by a cursory web search like books and experts. Adding better sources is something we're improving on. A simple next step we're looking into is to expand sources to research papers e.g explicitly pulling from arxiv on some queries.
now you can’t reproduce it because it’s probabilistic. each step takes half a second, so you sit there for 10–20 minutes just waiting for a chance to see what went wrong