Show HN: Open-source model and scorecard for measuring hallucinations in LLMs

simonhughes22 · on Nov 6, 2023

I worked on the model with our research team. Recently featured in this NYT (https://www.nytimes.com/2023/11/06/technology/chatbots-hallu.... Post here to AMA. We are also looking for collaborators to help us maintain this model and make it the best it can be. Let us know if you want to help

Boerworz · on Nov 6, 2023

Hey, looks like your (very interesting) link got formatted incorrectly! Should be https://www.nytimes.com/2023/11/06/technology/chatbots-hallu..., right? :)

simonhughes22 · on Nov 6, 2023

Yes thanks for fixing that.

gcr · on Nov 6, 2023

interesting work, thanks! Not enough people studying this.

Do you have a whitepaper describing how you trained this hallucination detection model?

Is each row of the leaderboard the mean of the Vectara model's judgment of the 831 (article,summary) pairs, or was there any human rating involved? With so few pairs, it seems feasible that human ratings should be able to quantify how much hallucination is actually occurring.

simonhughes22 · on Nov 6, 2023

We may write a research paper at some point. For now, see here: https://vectara.com/cut-the-bull-detecting-hallucinations-in...

Given the number of models involved, we have over 9k rows currently. Judging for this task is quite time consuming as you need to read a whole document and check it against a several sentence summary and some of the docs are a 1-3 min read. We wanted to automate this process and also make it as objective as possible (even humans can miss hallucinations or disagree on an annotation). Plus we also wanted people to be able to replicate the work, non of which is possible with a human rater, plus others have attempted that but on a much smaller scale, e.g. see AnyScales - https://www.anyscale.com/blog/llama-2-is-about-as-factually-... (but note that is under 1k examples).

We did some human validation and the model is well in alignment with humans but not in perfect agreement, as it is a model after all. And again human's don't agree 100% of the time on this task either.

awadallah · on Nov 6, 2023

I am CEO and one of cofounders of Vectara. We are very proud of the release of this open source eval model. We certainly would like to add more LLMs to the scorecard, and would love to collaborate with others to make the evaluation model even more accurate. Please reach out to bader@ or simon@ if interested.

zer00eyz · on Nov 6, 2023

This link gives a lot more context for the less informed:

https://vectara.com/cut-the-bull-detecting-hallucinations-in...

I really want to know more about the hallucinations produced! Were some sources more likely to produce errors! I would be curious if other document sets were explored for use with this test and if different source material would change the results in a meaningful way.

simonhughes22 · on Nov 6, 2023

The original data we used was not annotated with sources, only where the overall data came from. Most was news articles. The length doesn't seem to matter too much as we see a lot of errors even when summarizing a single sentence (sometimes the model felt compelled to elaborate more info). Usually the hallucinations were common sense inferences, such as assuming the plant was a cannabis plan in the example listed in the NYT article. Other times the LLM would invert things. Eg. if you ask any of the google LLMs to summarize an article about a famous boxer, where the article stated that Wahlberg was a fan of said boxer, the Palm models would flip it to say the boxer was a fan of Wahlbergs. Even the latest Bard model still does that, I tested it this weekend. It's a subtle and small error. But it's still factually incorrect.

simonhughes22 · on Nov 6, 2023

You can view the responses here in the linked csv file: https://github.com/vectara/hallucination-leaderboard

vinni2 · on Nov 6, 2023

Great work! Interesting to see Llama 2 7B is better than Llama 2 13B.

forrestbao · on Nov 6, 2023

A smaller model has less capacity and thus is less prone to overfitting, which is a reason of hallucination. Overfitting means that a model cannot adjust its output based on the input that is unseen during training.

simonhughes22 · on Nov 6, 2023

Yes. Just because the model is smaller doesn't always mean by default it's worse, as they may be trained for less time or on less data, which in some cases could be beneficial. The differences are small so may not be statistically significant. Plus the model is doing the evaluation, so while it's highly correlated with humans, a small difference like this may not mean that the the 7B model is necessarily better.

galaind · on Nov 7, 2023

Nice work, is there a plan for publication on the methodology used? Is there a survey paper of how different architectures influence hallucination?

VectaraStartups · on Nov 6, 2023

We're excited to collaborate with the industry and quantify hallucinations!