Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Show HN: Open-source model and scorecard for measuring hallucinations in LLMs (vectara.com)
65 points by eskibars on Nov 6, 2023 | hide | past | favorite | 14 comments
Hi all! This morning, we released a new Apache 2.0 licensed model on HuggingFace for detecting hallucinations in retrieval augmented generation (RAG) systems.

What we've found is that even when given a "simple" instruction like "summarize the following news article," every LLM that's available hallucinates to some extent, making up details that never existed in the source article -- and some of them quite a bit. As a RAG provider and proponents of ethical AI, we want to see LLMs get better at this. We've published an open source model, a blog more thoroughly describing our methodology (and some specific examples of these summarization hallucinations), and a GitHub repository containing our evaluation from the most popular generative LLMs available today. Links to all of them are referenced in the blog here, but for the technical audience here, the most interesting additional links might be:

- https://huggingface.co/vectara/hallucination_evaluation_mode...

- https://github.com/vectara/hallucination-leaderboard

We hope that releasing these under a truly open source license and detailing the methodology, we hope to increase the viability of anyone really quantitatively measuring and improving the generative LLMs they're publishing.



I worked on the model with our research team. Recently featured in this NYT (https://www.nytimes.com/2023/11/06/technology/chatbots-hallu.... Post here to AMA. We are also looking for collaborators to help us maintain this model and make it the best it can be. Let us know if you want to help


Hey, looks like your (very interesting) link got formatted incorrectly! Should be https://www.nytimes.com/2023/11/06/technology/chatbots-hallu..., right? :)


Yes thanks for fixing that.


interesting work, thanks! Not enough people studying this.

Do you have a whitepaper describing how you trained this hallucination detection model?

Is each row of the leaderboard the mean of the Vectara model's judgment of the 831 (article,summary) pairs, or was there any human rating involved? With so few pairs, it seems feasible that human ratings should be able to quantify how much hallucination is actually occurring.


We may write a research paper at some point. For now, see here: https://vectara.com/cut-the-bull-detecting-hallucinations-in...

Given the number of models involved, we have over 9k rows currently. Judging for this task is quite time consuming as you need to read a whole document and check it against a several sentence summary and some of the docs are a 1-3 min read. We wanted to automate this process and also make it as objective as possible (even humans can miss hallucinations or disagree on an annotation). Plus we also wanted people to be able to replicate the work, non of which is possible with a human rater, plus others have attempted that but on a much smaller scale, e.g. see AnyScales - https://www.anyscale.com/blog/llama-2-is-about-as-factually-... (but note that is under 1k examples).

We did some human validation and the model is well in alignment with humans but not in perfect agreement, as it is a model after all. And again human's don't agree 100% of the time on this task either.


I am CEO and one of cofounders of Vectara. We are very proud of the release of this open source eval model. We certainly would like to add more LLMs to the scorecard, and would love to collaborate with others to make the evaluation model even more accurate. Please reach out to bader@ or simon@ if interested.


This link gives a lot more context for the less informed:

https://vectara.com/cut-the-bull-detecting-hallucinations-in...

I really want to know more about the hallucinations produced! Were some sources more likely to produce errors! I would be curious if other document sets were explored for use with this test and if different source material would change the results in a meaningful way.


The original data we used was not annotated with sources, only where the overall data came from. Most was news articles. The length doesn't seem to matter too much as we see a lot of errors even when summarizing a single sentence (sometimes the model felt compelled to elaborate more info). Usually the hallucinations were common sense inferences, such as assuming the plant was a cannabis plan in the example listed in the NYT article. Other times the LLM would invert things. Eg. if you ask any of the google LLMs to summarize an article about a famous boxer, where the article stated that Wahlberg was a fan of said boxer, the Palm models would flip it to say the boxer was a fan of Wahlbergs. Even the latest Bard model still does that, I tested it this weekend. It's a subtle and small error. But it's still factually incorrect.


You can view the responses here in the linked csv file: https://github.com/vectara/hallucination-leaderboard


Great work! Interesting to see Llama 2 7B is better than Llama 2 13B.


A smaller model has less capacity and thus is less prone to overfitting, which is a reason of hallucination. Overfitting means that a model cannot adjust its output based on the input that is unseen during training.


Yes. Just because the model is smaller doesn't always mean by default it's worse, as they may be trained for less time or on less data, which in some cases could be beneficial. The differences are small so may not be statistically significant. Plus the model is doing the evaluation, so while it's highly correlated with humans, a small difference like this may not mean that the the 7B model is necessarily better.


Nice work, is there a plan for publication on the methodology used? Is there a survey paper of how different architectures influence hallucination?


We're excited to collaborate with the industry and quantify hallucinations!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: