Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Long-form factuality in large language models (arxiv.org)
26 points by PaulHoule on April 6, 2024 | hide | past | favorite | 16 comments


So using LLMs to fact-check LLMs with Google results? There are non-LLM fact-checkers out there [1] which aggregate search results from more diverse sources and which are shown to outperform LLMs [2].

[1] https://editor.factiverse.ai [2] https://arxiv.org/abs/2402.12147


I skimmed the paper you linked. Which to be clear to other readers, is a write-up of the technology used in the first link. It looked to me like they use small LLMs to identify factual claims, and rewrite them into search terms. Then the terms are searched using an internal database or search engine, and then LLMs are used to roughly classify how true the sources think the claim is. It looks like a neat product, and trying out the link shows that the sources themselves are presented alongside the classification for the human to peruse, but it looks like it is using LLMs rather heavily. I certainly wouldn't classify it as a non-LLM fact checker.


That literally requires human annotators. If you can do even 75% as well as a human but at 1000x the scale, you still win using the LLMs. Not to mention, I personally do think on a simple task like "Does document A support response B", GPT can probably beat most humans unless the humans are going very slow and possibly have domain knowledge depending on the case.


factiverse.ai states GPT-4 is open source, that there is a colony on Mars and is mixed on whether Finland is a real country.

Still needs work


Looks like a slight modification of FActScore [1], but instead of using Wikipedia as a verification source, they use Google Search. They also claim to include a wider range of topics. That said, FActScore allows you to use whatever knowledge source and topics you want [2].

[1]: https://arxiv.org/abs/2305.14251

[2]: https://github.com/shmsw25/FActScore?tab=readme-ov-file#to-u...


> At the same time, SAFE is more than 20 times cheaper than human annotators.> LLMs have achieved superhuman performance on reason- ing benchmarks (Bubeck et al., 2023; Gemini Team, 2023; Li et al., 2022) and higher factuality than humans on sum- marization tasks (Pu et al., 2023). The use of human annota- tors, however, is still prevalent in language-model research, which often creates a bottleneck due to human annotators’ high cost and variance in both evaluation speed and qual- ity. Because of this mismatch, we investigate how SAFE compares to human annotations and whether it can replace human raters in evaluating long-form factuality.

I am not an expert in this research but this seems like this is just a slippery slope all in the pursuit of cost, first and foremost.

Given that the summary focuses on cost and this paragraph mentions cost as the first point, it sure seems like these folks only goal goal is to just take humans out of the mix entirely when it comes to facts.

Is this a good idea? I am not sure.


If they can guarantee human level or better performance, no problem at all (from a technical perspective; social is another matter).


I'd argue the issue is data set drift. This works with Google _today_ but will it work with Google tomorrow? Will it work when Google changes its search algorithms/rankings? Will it work when AI overtakes human content on Google?

You can't trust that this will work on your knowledge domain or that it'll work in the future.


This seems a lot more like consensus checking, than fact checking.


Consensus checking works for uncommon errors, where there are more ways to get it wrong than to get it right. It doesn’t work for errors where lots of people get it wrong the same way.


It’s funny that Google researchers use gpt4 to generate the question bank.


GPT4 is the SOTA, for better or for worse, so it would be just stupid to not use it just for PR reasons.


Opus is SOTA since about a week ago, but that’s not the point, Google claimed their Ultra is better than gpt4.


Using Google as an arbitrator of truth? What could go wrong? The core concept of breaking down statements to check for factuality is otherwise sound.


I think it will likely work well for certain categories or topics, but will likely stray very far from truth for medical or nutritional topics.


Sounds like fraudulent pages will train this one




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: