The only way to make check whether a LLM output is true is to do the work (to have it dkne by a real person).
For tasks that are trivial to verify, it's ok: a code compiler will run the code written by a LLM. Or: ask a LLM to help you during the examples mapping phase of BDD, and you'll quickly be able to tell what's good and what isn't.
But for the following tasks, there is a risk:
- ask a LLM to make a summary of an email your didn't read. You can't trust the result.
- you're a car mechanic. You dump your thoughts to a voice recorder, and use AI to turn it into a textual structured report. You'd better tripple check the output!
- you're a medical doctor, attempting to do the same trick: you'd have to be extra careful with the result!
And don't count on software testing to make AI tool robust: LLM are non deterministic.
"Microsoft Bing Copilot has falsely described a German journalist as a child molester, an escapee from a psychiatric institution, and a fraudster who preys on widows.
Martin Bernklau, who has served for years as a court reporter [...] asked Microsoft Bing Copilot about himself. He found that Microsoft's AI chatbot had blamed him for crimes he had covered."
That's why I set up my own thing.
I don't care about analytics at all, so I just wrote a simple build system doto generate some very basic HTML redirects.
That was unclear for me too, so I opened an issue (the very first of the project, yay!). The instructions have been updated, they might be a bit clearer now: direct contributions to the code are welcome.
John Casablancas, the father of Stroke's lead singer, owned Elite Model Management, one of the world's largest models agencies, if not the largest at the time. Rumors among the knowledgeable fans is that Julian used to ring up to his father agency to send "fans" at their initial shows. [1]
This without doubt influenced the decisions of venue owners to book them. Even if they were in on the ruse, who can afford to reject the well known "multiplier effect" 20-30 models would have for your venue?
For tasks that are trivial to verify, it's ok: a code compiler will run the code written by a LLM. Or: ask a LLM to help you during the examples mapping phase of BDD, and you'll quickly be able to tell what's good and what isn't.
But for the following tasks, there is a risk: - ask a LLM to make a summary of an email your didn't read. You can't trust the result. - you're a car mechanic. You dump your thoughts to a voice recorder, and use AI to turn it into a textual structured report. You'd better tripple check the output! - you're a medical doctor, attempting to do the same trick: you'd have to be extra careful with the result!
And don't count on software testing to make AI tool robust: LLM are non deterministic.