I am not sure if you are familiar with Pangram (co-founder here) but we are a gr...

nialse · 2025-11-29T18:10:58 1764439858

Nothing points out that the benchmark is invalid like a zero false positive rate. Seemingly it is pre-2020 text vs a few models rework of texts. I can see this model fall apart in many real world scenarios. Yes, LLMs use strange language if left to their own devices and this can surely be detected. 0% false positive rate under all circumstances? Implausible.

maxspero · 2025-11-29T18:21:28 1764440488

Our benchmarks of public datasets put our FPR roughly around 1 in 10,000. https://www.pangram.com/blog/all-about-false-positives-in-ai...

Find me a clean public dataset with no AI involvement and I will be happy to report Pangram's false positive rate on it.

godelski · 2025-11-30T22:59:06 1764543546

Max, there's two problems I see with your comment.

1) the paper didn't show a 0% FNR. I mean tables 4, 7, and B.2 are pretty explicit. It's not hard to figure out from the others either.

2) a 0% error rate requires some pretty serious assumptions to be true. For that type of result to not be incredibly suspect requires there to be zero noise in the data, analysis, and at all parts. I do not see that being true of the mentioned dataset.

Even high scores are suspect. Generalizing the previous a score is suspect if it is higher than the noise level. Can you truly attest that this condition is true?

I'm suspect that you're introducing data leakage. I haven't looked enough into your training and data to determine how that's happening but you'll probably need a pretty deep analysis as leakage is really easy to sneak in. It can do so in non obvious ways. A very common one is tuning hyper parameters on test results. You don't have to pass data to pass information. Another sly way for this to happen is that the test set isn't significantly disjoint from the training set. If the perturbation is too small then you aren't testing generalization you're testing a slightly noisy training set (which your training should be introducing noise to help regularize, so you end up just measuring your training performance).

Your numbers are too good and that's suspect. You need a lot more evidence to suggest they mean what you want them to mean.

Oarch · 2025-11-29T18:49:23 1764442163

I enjoyed this thoughtful write up. It's a vitally important area for good, transparent work to be done.

pinkmuffinere · 2025-11-29T19:26:11 1764444371

> Nothing points out that the benchmark is invalid like a zero false positive rate

You’re punishing them for claiming to do a good job. If they truly are doing a bad job, surely there is a better criticism you could provide.

nialse · 2025-11-30T09:56:00 1764496560

No one is punishing anyone. They just make an implausible claim. That is it.

bonsai_spool · 2025-11-29T19:06:59 1764443219

                          EditLens (Ours)
                   Predicted Label
                Human     Mix       AI
             ┌─────────┬─────────┬─────────┐
       Human │  1770   │   111   │    0    │
             ├─────────┼─────────┼─────────┤
 True  Mix   │   265   │  1945   │   28    │
 Label       ├─────────┼─────────┼─────────┤
         AI  │    0    │   186   │  1695   │
             └─────────┴─────────┴─────────┘

It looks like 5% of human texts from your paper are marked as mixed, and mixed texts are 5-10% if mixed texts as AI, from your paper.

I guess I don’t see that this is much better than what’s come before, using your own paper.

Edit: this is an irresponsible Nature news article, too - we should see a graph of this detector over the past ten years to see how much of this ‘deluge’ is algorithmic error

lifthrasiir · 2025-11-29T18:57:38 1764442658

It is not wise to brag about your product when the GP is pointing out that the article "reads like PR for Pangram", no matter AI detectors are reliable or not.

glenstein · 2025-11-29T19:12:09 1764443529

I would say it's important to hold off on the moralizing until after showing visible effort to reflect on the substance of the exchange, which in this case is about the fairness of asserting that the detection methodology employed in this particular case shares the flaws of familiar online AI checkers. That's an importantly substantive and rebuttable point and all the meaningful action in the conversation is embedded in those details.

In this case, several important distinctions are drawn, including being open about criteria, about such things as "perplexity" and "burstiness" as properties being tested for, and an explanation of why they incorrectly claim the Declaration of Independence is AI generated (it's ubiquitous). So it seems like a lot of important distinctions are being drawn that testify to the credibility of the model, which has to matter to you if you're going to start moralizing.

rs186 · 2025-11-29T18:24:49 1764440689

The response would be more helpful if it directly addresses the arguments in posts from that search result.

maxspero · 2025-11-29T18:38:47 1764441527

There are dozens of first generation AI detectors and they all suck. I'm not going to defend them. Most of them use perplexity based methods, which is a decent separators of AI and human text (80-90%) but has flaws that can't be overcome and high FPRs on ESL text.

https://www.pangram.com/blog/why-perplexity-and-burstiness-f...

Pangram is fundamentally different technology, it's a large deep learning based model that is trained on hundreds of millions of human and AI examples. Some people see a dozen failed attempts at a problem as proof that the problem is impossible, but I would like to remind you that basically every major and minor technology was preceded by failed attempts.

QuadmasterXLII · 2025-11-29T19:35:58 1764444958

Some people see a dozen extremely profitable, extremely destructive attempts at a problem as proof that the problem is not a place for charitable interpretation.

wordpad · 2025-11-29T20:22:59 1764447779

And you don't think a dozen of basically scams around the technology justify extreme scepticism?

QuadmasterXLII · 2025-11-29T20:33:35 1764448415

pixl97 · 2025-11-29T18:57:57 1764442677

GAN.. Just feed the output of your algorithms back into the LLM while learning. At the end of the day the problem is impossible, but we're not there yet.

anonymouskimmer · 2025-11-29T20:31:42 1764448302

Can your software detect which LLMs most likely generated a text?

maxspero · 2025-11-29T22:05:43 1764453943

Pangram is trained on this task as well to add additional signal during training, but it's only ~90% accurate so we don't show the prediction in public-facing results

ugh123 · 2025-11-29T22:38:28 1764455908

How do you discern between papers "completely fabricated" by AI vs. edited by AI for grammar?

ThrowawayTestr · 2025-11-29T20:46:44 1764449204

Are you concerned with your product being used to improve AI to be less detectable?

Majromax · 2025-11-29T21:16:15 1764450975

> Are you concerned with your product being used to improve AI to be less detectable?

The big AI providers don't have any obvious incentive to do this. If it happens 'naturally' in the pursuit of quality then sure, but explicitly training for stealth is a brand concern in the same way that offering a fully uncensored model would be.

Smaller providers might do this (again in the same way they now offer uncensored models), but they occupy a miniscule fraction of the market and will be a generation or two behind the leaders.

ThrowawayTestr · 2025-11-29T21:50:48 1764453048

They don't have an incentive to make their AIs better? If your product can genuinely detect AI writing, of course they would use it to make their models sound more human. The biggest criticism of AI right now is how robotic and samey it sounds.

maxspero · 2025-11-29T22:08:51 1764454131

It's definitely going to be a back and forth - model providers like OpenAI want their LLMs to sound human-like. But this is the battle we signed up for, and we think we're more nimble and can iterate faster to stay one step ahead of the model providers.

ThrowawayTestr · 2025-11-29T22:17:22 1764454642

That sounds extremely naive but good luck!

jay_kyburz · 2025-11-29T18:37:38 1764441458

I thought the author was attempting to highlight the hypocrisy of using an AI to detect other uses of AI, as if one was a good use, and the other bad.

interleave · 2025-11-30T11:02:52 1764500572

Hi Max! Thank you for updating my mental model of AI detectors.

I was with total certainty under the impression that detecting AI-written text to be an impossible-to-solve problem. I think that's because it's just so deceptively intuitive to believe that "for every detector, there'll just be a better LLM and it'll never stop."

I had recently published a macOS app called Pudding to help humans prove they wrote a text mainly under the assumption that this problem can't be solved with measurable certainty and traditional methods.

Now I'm of course a bit sad that the problem (and hence my solution) can be solved much more directly. But, hey, I fell in love with the problem, so I'm super impressed with what y'all are accomplishing at and with Pangram!

moffkalast · 2025-11-29T18:24:55 1764440695

I see the bullshit part continues on the PR side as well, not just in the product.