Evals are critical, and I love the practicality of this guide!
One problem not covered here is: knowing which data to review.
If your AI system produces say 95% accurate responses, your Evals team will spend too much time reviewing production logs to discover different AI failure modes.
To enable your Evals team to only spend time reviewing the high-signal responses that are likely incorrect, I built a tool that automatically surfaces the least trustworthy LLM responses:
This is why I built a startup for automated real-time trustworthiness scoring of LLM responses: https://help.cleanlab.ai/tlm/
Tools to mitigate unchecked hallucination are critical for high-stakes AI applications across finance, insurance, medicine, and law. At many enterprises I work with, even straightforward AI for customer support is too unreliable without a trust layer for detecting and remediating hallucinations.
How do we know the TLM is any more accurate than the LLM (especially if it's not trained on any local data)? If determining veracity were that simple, LLMs would just incorporate a fact-checking stage.
You might be thinking of LLM as-a-judge, where one simply asks another LLM to fact-check the response. Indeed that is very unreliable due to LLM hallucinations, the problem we are trying to mitigate in the first place.
TLM is instead an uncertainty estimation technique applied to LLMs, not another LLM model.
Tested the free chat. The chat bot gave slightly incorrect answer, and trustworthiness gave it score of 0.749 and said the answer is completely incorrect, which was not actually the case. Seems more confusing with two answers that are somewhat wrong.
Curious to learn how much harder it is to red-team models that use the second line of defense of an explicit guardrails library that checks the LLM response in a second step. Such as Nvidia's Nemo Guardrails package.
I'm excited for LLM applications that can setup, monitor/validate, and optimize data pipelines at scale. Seems possible soon given that SQL and most data records aren't intended to be human-friendly
When LLMs can do the following, they might be able to fix data hell:
- Negotiate with different teams to figure out what a field means
- Be told that a field should be converted from one format to another, but oh wait it's causing errors somewhere downstream because it was told the wrong instructions
- People come to you with some issue about the code you maintain, and you dig enough to realize the root cause is another team's code
One problem not covered here is: knowing which data to review.
If your AI system produces say 95% accurate responses, your Evals team will spend too much time reviewing production logs to discover different AI failure modes.
To enable your Evals team to only spend time reviewing the high-signal responses that are likely incorrect, I built a tool that automatically surfaces the least trustworthy LLM responses:
https://help.cleanlab.ai/tlm/
Hope you find it useful, I made sure it works out-of-the-box with zero-configuration required!