More

_jonas · 2025-07-03T20:48:37 1751575717

Evals are critical, and I love the practicality of this guide!

One problem not covered here is: knowing which data to review.

If your AI system produces say 95% accurate responses, your Evals team will spend too much time reviewing production logs to discover different AI failure modes.

To enable your Evals team to only spend time reviewing the high-signal responses that are likely incorrect, I built a tool that automatically surfaces the least trustworthy LLM responses:

https://help.cleanlab.ai/tlm/

Hope you find it useful, I made sure it works out-of-the-box with zero-configuration required!

hamelsmu · 2025-07-03T21:07:51 1751576871

Hamel here. Thanks so much for asking this question! I will work on adding it to the FAQ. Please keep these coming!

_jonas · 2025-05-06T03:08:22 1746500902

This is why I built a startup for automated real-time trustworthiness scoring of LLM responses: https://help.cleanlab.ai/tlm/

Tools to mitigate unchecked hallucination are critical for high-stakes AI applications across finance, insurance, medicine, and law. At many enterprises I work with, even straightforward AI for customer support is too unreliable without a trust layer for detecting and remediating hallucinations.

insane_dreamer · 2025-05-06T05:09:19 1746508159

Who is watching the watchers?

How do we know the TLM is any more accurate than the LLM (especially if it's not trained on any local data)? If determining veracity were that simple, LLMs would just incorporate a fact-checking stage.

_jonas · 2025-05-07T23:31:30 1746660690

You might be thinking of LLM as-a-judge, where one simply asks another LLM to fact-check the response. Indeed that is very unreliable due to LLM hallucinations, the problem we are trying to mitigate in the first place.

TLM is instead an uncertainty estimation technique applied to LLMs, not another LLM model.

_jonas · 2025-04-17T19:42:31 1744918951

Exactly, that's why my startup recommends all LLM outputs should come with trustworthiness scores:

https://cleanlab.ai/tlm/

_jonas · 2025-04-17T19:39:46 1744918786

My startup is working on this fundamental problem.

You can try out our early product here: https://cleanlab.ai/tlm/

(free to try, we'd love to hear your feedback)

Reubensson · 2025-04-18T08:46:47 1744966007

Tested the free chat. The chat bot gave slightly incorrect answer, and trustworthiness gave it score of 0.749 and said the answer is completely incorrect, which was not actually the case. Seems more confusing with two answers that are somewhat wrong.

_jonas · 2025-04-17T19:34:49 1744918489

I see this fallacy often too.

My company provides hallucination detection software: https://cleanlab.ai/tlm/

But we somehow end up in sales meetings where the person who requested the meeting claims their AI does not hallucinate ...

_jonas · on Nov 3, 2024

Has anyone run any meaningful benchmarks of this vs. google vs. perplexity?

Chkhikvadze · on Nov 4, 2024

I would love to have it as well.

_jonas · on Nov 3, 2024

This one looks pretty good, haven't tried it yet though: https://github.com/QuivrHQ/quivr

_jonas · on Nov 3, 2024

It's fun to try and guess what semantic concepts might be captured within individual dimensions / pairs of dimensions of the embeddings space.

_jonas · on Nov 3, 2024

Curious to learn how much harder it is to red-team models that use the second line of defense of an explicit guardrails library that checks the LLM response in a second step. Such as Nvidia's Nemo Guardrails package.

_jonas · on Nov 3, 2024

I'm excited for LLM applications that can setup, monitor/validate, and optimize data pipelines at scale. Seems possible soon given that SQL and most data records aren't intended to be human-friendly

sph · on Nov 3, 2024

Ah, the hubris of youth.

sourcepluck · on Nov 3, 2024

Either this is a joke comment, or you haven't seen https://ludic.mataroa.blog/blog/i-will-fucking-piledrive-you...

FridgeSeal · on Nov 3, 2024

Once again. Because clearly this needs to be said loudly and repeatedly.

That is a _technical solution_ to a fundamentally _social_ problem.

hermitcrab · on Nov 3, 2024

And now we have two problems...

hinkley · on Nov 3, 2024

Ten problems. Regex is for wusses who fear danger. LLMs are where real men turn.

spencerchubb · on Nov 3, 2024

When LLMs can do the following, they might be able to fix data hell:

- Negotiate with different teams to figure out what a field means

- Be told that a field should be converted from one format to another, but oh wait it's causing errors somewhere downstream because it was told the wrong instructions

- People come to you with some issue about the code you maintain, and you dig enough to realize the root cause is another team's code

hinkley · on Nov 3, 2024

- explain how they came up with “12” as the answer.