Hacker Newsnew | past | comments | ask | show | jobs | submit | _jonas's commentslogin

Evals are critical, and I love the practicality of this guide!

One problem not covered here is: knowing which data to review.

If your AI system produces say 95% accurate responses, your Evals team will spend too much time reviewing production logs to discover different AI failure modes.

To enable your Evals team to only spend time reviewing the high-signal responses that are likely incorrect, I built a tool that automatically surfaces the least trustworthy LLM responses:

https://help.cleanlab.ai/tlm/

Hope you find it useful, I made sure it works out-of-the-box with zero-configuration required!


Hamel here. Thanks so much for asking this question! I will work on adding it to the FAQ. Please keep these coming!


This is why I built a startup for automated real-time trustworthiness scoring of LLM responses: https://help.cleanlab.ai/tlm/

Tools to mitigate unchecked hallucination are critical for high-stakes AI applications across finance, insurance, medicine, and law. At many enterprises I work with, even straightforward AI for customer support is too unreliable without a trust layer for detecting and remediating hallucinations.


Who is watching the watchers?

How do we know the TLM is any more accurate than the LLM (especially if it's not trained on any local data)? If determining veracity were that simple, LLMs would just incorporate a fact-checking stage.


You might be thinking of LLM as-a-judge, where one simply asks another LLM to fact-check the response. Indeed that is very unreliable due to LLM hallucinations, the problem we are trying to mitigate in the first place.

TLM is instead an uncertainty estimation technique applied to LLMs, not another LLM model.


Exactly, that's why my startup recommends all LLM outputs should come with trustworthiness scores:

https://cleanlab.ai/tlm/


My startup is working on this fundamental problem.

You can try out our early product here: https://cleanlab.ai/tlm/

(free to try, we'd love to hear your feedback)


Tested the free chat. The chat bot gave slightly incorrect answer, and trustworthiness gave it score of 0.749 and said the answer is completely incorrect, which was not actually the case. Seems more confusing with two answers that are somewhat wrong.


I see this fallacy often too.

My company provides hallucination detection software: https://cleanlab.ai/tlm/

But we somehow end up in sales meetings where the person who requested the meeting claims their AI does not hallucinate ...


Has anyone run any meaningful benchmarks of this vs. google vs. perplexity?


I would love to have it as well.


This one looks pretty good, haven't tried it yet though: https://github.com/QuivrHQ/quivr


It's fun to try and guess what semantic concepts might be captured within individual dimensions / pairs of dimensions of the embeddings space.


Curious to learn how much harder it is to red-team models that use the second line of defense of an explicit guardrails library that checks the LLM response in a second step. Such as Nvidia's Nemo Guardrails package.


I'm excited for LLM applications that can setup, monitor/validate, and optimize data pipelines at scale. Seems possible soon given that SQL and most data records aren't intended to be human-friendly


Ah, the hubris of youth.


Either this is a joke comment, or you haven't seen https://ludic.mataroa.blog/blog/i-will-fucking-piledrive-you...


Once again. Because clearly this needs to be said loudly and repeatedly.

That is a _technical solution_ to a fundamentally _social_ problem.


And now we have two problems...


Ten problems. Regex is for wusses who fear danger. LLMs are where real men turn.


When LLMs can do the following, they might be able to fix data hell:

- Negotiate with different teams to figure out what a field means

- Be told that a field should be converted from one format to another, but oh wait it's causing errors somewhere downstream because it was told the wrong instructions

- People come to you with some issue about the code you maintain, and you dig enough to realize the root cause is another team's code


- explain how they came up with “12” as the answer.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: