There are two possible explanations for this behavior: the model nerf is real, o...

davidsainez · 2025-11-24T21:42:25 1764020545

There are well documented cases of performance degradation: https://www.anthropic.com/engineering/a-postmortem-of-three-....

The real issue is that there is no reliable system currently in place for the end user (other than being willing to burn the cash and run your own benchmarks regularly) to detect changes in performance.

It feels to me like a perfect storm. A combination of high cost of inference, extreme competition, and the statistical nature of LLMs make it very tempting for a provider to tune their infrastructure in order to squeeze more volume from their hardware. I don't mean to imply bad faith actors: things are moving at breakneck speed and people are trying anything that sticks. But the problem persists, people are building on systems that are in constant flux (for better or for worse).

Wowfunhappy · 2025-11-24T21:51:47 1764021107

> There are well documented cases of performance degradation: https://www.anthropic.com/engineering/a-postmortem-of-three-...

There was one well-documented case of performance degradation which arose from a stupid bug, not some secret cost cutting measure.

davidsainez · 2025-11-24T22:33:33 1764023613

I never claimed that it was being done in secrecy. Here is another example: https://groq.com/blog/inside-the-lpu-deconstructing-groq-spe....

I have seen multiple people mention openrouter multiple times here on HN: https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que...

Again, I'm not claiming malicious intent. But model performance depends on a number of factors and the end-user just sees benchmarks for a specific configuration. For me to have a high degree of confidence in a provider I would need to see open and continuous benchmarking of the end-user API.

stingraycharles · 2025-11-25T13:36:45 1764077805

All those are completely irrelevant. Quantization is just a cost optimization.

People are claiming that Anthropic et all changes the quality of the model after the initial release, which is entirely different and the industry as a whole has denied. When a model is released under a certain version, the model doesn’t change.

The only people who believe this are in the vibe coding community, believing that there’s some kind of big conspiracy, but any time you mention “but benchmarks show the performance stays consistent” you’re told you’re licking corporate ass.

mopierotti · 2025-11-25T15:18:17 1764083897

I might be misunderstanding your point, but quantization can have a dramatic impact on the quality of the model's output.

For example, in diffusion, there are some models where a Q8 quant dramatically changes what you can achieve compared to fp16. (I'm thinking of the Wan video models.) The point I'm trying to make is that it's a noticeable model change, and can be make-or-break.

stingraycharles · 2025-11-26T01:27:26 1764120446

Of course, no one is debating that. What’s being debated is whether this is done after a model’s initial release, eg Anthropic will secretly change the new Opus model to perform worse but be more cost efficient in a few weeks.

anon7000 · 2025-11-25T07:15:55 1764054955

> some secret cost cutting measure

That’s not the point — it’s just a day in the life of ops to tweak your system to improve resource utilization and performance. Which can cause bugs you don’t expect in LLMs. it’s a lot easier to monitor performance in a deterministic system, but harder to see the true impact a change has to the LLM

jaggs · 2025-11-25T07:56:04 1764057364

https://www.youtube.com/watch?v=DtePicx_kFY

"There's something still not quite right with the current technology. I think the phrase that's becoming popular is 'jagged intelligence'. The fact that you can ask an LLM something and they can solve literally a PhD level problem, and then in the next sentence they can say something so clearly, obviously wrong that it's jarring. And I think this is probably a reflection of something fundamentally wrong with the current architectures as amazing as they are."

Llion Jones, co-inventor of transformers architecture

zbyforgotp · 2025-11-25T09:53:27 1764064407

There is something not right with expecting that artificial intelligence will have the same characteristics as human intelligence. (I am answering to the quote)

jaggs · 2025-11-25T10:02:17 1764064937

I think he's commenting more on the inconsistency of it, rather than the level of intelligence per se.

roudaki · 2025-11-25T14:40:53 1764081653

this. I keep repeating to people to stick to very specific questions with very specific limits and expectations but no... give me 20 pages of phd level text that finds cure for cancer

data-ottawa · 2025-11-25T05:09:23 1764047363

The previous “nerf” was actually several bugs that dramatically decreased performance for weeks.

I do suspect continued fine tuning lowers quality — stuff they roll out for safety/jailbreak prevention. Those should in theory buildup over time with their fine tune dataset, but each model will have its own flaws that need tuning out.

I do also suspect there’s a bit of mental adjustment that goes in too.

blurbleblurble · 2025-11-24T21:04:32 1764018272

I'm pretty sure this isn't happening with the API versions as much as with the "pro plan" (loss leader priced) routers. I imagine that there are others like me working on hard problems for long periods with the model setting pegged to high. Why wouldn't the companies throttle us?

It could even just be that they just apply simple rate limits and that this degrades the effectiveness of the feedback loop between the person and the model. If I have to wait 20 minutes for GPT-5.1-codex-max medium to look at `git diff` and give a paltry and inaccurate summary (yes this is where things are at for me right now, all this week) it's not going to be productive.

fakwandi_priv · 2025-11-25T09:02:39 1764061359

I run the same config but it tends to fly through those commands on the weekends, very noticeable difference. I wouldn’t be surprised that the subscription users have a (much) lower priority.

That said I don’t go beyond 70% of my weekly limit so there’s that.

imiric · 2025-11-24T21:07:23 1764018443

Or, 2b: the nerf is real, but benchmarks are gamed and models are trained to excel at them, yet fall flat in real world situations.

metalliqaz · 2025-11-24T21:14:36 1764018876

I mostly stay out of the LLM space but I thought it was an open secret already that the benchmarks are absolutely gamed.

camdenreslink · 2025-11-25T19:38:50 1764099530

As a personal anecdote, I had a fairly involved application that built up a context with a lot of custom prompting and created a ~1000 word output. I could run my application over and over again to inspect the results. It was fairly reproducible.

I was having really nice results with the o4-mini model with high thinking. A little while after GPT-5 came out I revisited my application and tried to continue. The o4-mini results were unusable, while the GPT-5 results were similar to what I had before. I'm not sure what happened to the model in those ~4-5 months I set it down, but there was real degradation.

parineum · 2025-11-25T04:04:57 1764043497

Is there a reason not to think that, when "refining" the models they're using the benchmarks as the measure and it shows no fidelity loss but in some unbenchmarked ways, the performance is worse. "Once a measure becomes a target, it's no longer a useful measure."

That's case #2 for you but I think the explanation I've proposed is pretty likely.

conception · 2025-11-24T22:49:42 1764024582

The only time Ive seen benchmark nerfing is I saw one see a drop in performance between 2.5 march preview and release.

csomar · 2025-11-25T10:10:51 1764065451

They are nerfed and there is actually a very simple test to prove otherwise: 0 temperature. This is only allowed with the API where you are billed full token prices.

Conclusion: It is nerfed unless Claude can prove otherwise.

Wowfunhappy · 2025-11-25T17:38:03 1764092283

I don’t understand how you get from the first paragraph to the conclusion.

teruakohatu · 2025-11-25T04:42:58 1764045778

> 1. The nerf is psychologial, not actual. 2. The nerf is real but in a way that is perceptual to humans, but not benchmarks.

They could publish weekly benchmarks. To disprove. They almost certainly have internal benchmarking.

The shift is certainly real. It might not be model performance but contextual changes or token performance (tasks take longer even if the model stays the same).

ChadNauseam · 2025-11-25T05:02:42 1764046962

Anyone can publish weekly benchmarks. If you think anthropic is lying about not nerfing their models you shouldn't trust benchmarks they release anyway.

teruakohatu · 2025-11-26T04:07:30 1764130050

I never said they were lying. They haven’t stated that they do not tweak compute, and we know the app is updated regularly.

yawnxyz · 2025-11-25T05:05:45 1764047145

moving onto new hardware + caching + optimizations might actually change the output slightly; it'll still pass evals all the same but on the edges it just "feels weird" - and that's what makes it feel like it's nerfed

zsoltkacsandi · 2025-11-24T21:06:37 1764018397

> The nerf is psychologial, not actual

Once I tested this, I gave the same task for a model after the release and a couple weeks later. In the first attempt it produced a well-written code that worked beautifully, I started to worry about the jobs of the software engineers. Second attempt was a nightmare, like a butcher acting as a junior developer performing a surgery on a horse.

Is this empirical evidence?

And this is not only my experience.

Calling this phychological is gaslighting.

lukev · 2025-11-24T21:32:45 1764019965

> Is this empirical evidence?

Look, I'm not defending the big labs, I think they're terrible in a lot of ways. And I'm actually suspending judgement on whether there is ~some kind of nerf happening.

But the anecdote you're describing is the definition of non-empirical. It is entirely subjective, based entirely on your experience and personal assessment.

dash2 · 2025-11-25T04:40:07 1764045607

It's not non-empirical. He was careful to give it the same experiment twice. The dependent variable is his judgment, sure, but why shouldn't we trust that if he's an experienced SWE?

atq2119 · 2025-11-25T13:24:05 1764077045

Sample size is way too small.

Unless he was able to sample with temperature 0 (and get fully deterministic results both times), this can just be random chance. And experience as SWE doesn't imply experience with statistics and experiment design.

zsoltkacsandi · 2025-11-24T21:42:19 1764020539

> But the anecdote you're describing is the definition of non-empirical. It is entirely subjective, based entirely on your experience and personal assessment.

Well, if we see this way, this is true for Antrophic’s benchmarks as well.

Btw the definition of empirical is: “based on observation or experience rather than theory or pure logic”

So what I described is the exact definition of empirical.

ACCount37 · 2025-11-24T21:12:02 1764018722

No, it's entirely psychological.

Users are not reliable model evaluators. It's a lesson the industry will, I'm afraid, have to learn and relearn over and over again.

sillyfluke · 2025-11-25T11:52:56 1764071576

I don't really find this a helpful line to traverse. By this line of inquiry most of the things in software are psychological.

Whether something is a bug or feature.

Whether the right thing was built.

Whether the thing is behaving correctly in general.

Whether it's better at the very moment that the thing occasionally works for a whole range of stuff or that it works perfectly for a small subset.

Whether fast results are more important than absolutely correct results for a given context.

Yes, all things above are also related with each other.

The most we have for LLMs is tallying up each user's experience using an LLM for a period of time for a wide rane of "compelling" use cases (the pairing of their prompts and results are empirical though right?).

This should be no surprise, as humans often can't agree on an end-all-be-all intelligence test for humans either.

ACCount37 · 2025-11-25T12:40:17 1764074417

No. I'm saying that if you take the same exact LLM on the same exact set of hardware and serve it to the same exact humans, a sizeable amount of them will still complain about "model nerfs".

Why? Because humans suck.

zsoltkacsandi · 2025-11-24T21:19:20 1764019160

Giving the same prompt resulting in totally different results is not user evaluation. Nor psychological. You cannot tell the customer you are working for as a developer, that hey, first time it did what you asked, second time it ruined everything, but look, here is the benchmark from Antrophic, according to this there is nothing wrong.

The only thing that matters and that can evaluate performance is the end result.

But hey, the solution is easy: Antrophic can release their own benchmarks, so everyone can test their models any time. Why they don’t do it?

pertymcpert · 2025-11-24T21:34:07 1764020047

The models are non-deterministic. You can't just assume that because it did better before that it was on average better than before. And the variance is quite large.

zsoltkacsandi · 2025-11-24T21:39:17 1764020357

No one talked about determinism. First it was able to do a task, second time not. It’s not that the implementation details changed.

baq · 2025-11-24T21:53:32 1764021212

This isn’t how you should be benchmarking models. You should give it the same task n times and see how often it succeeds and/or how long it takes to be successful (see also the 50% time horizon metric by METR).

zsoltkacsandi · 2025-11-24T22:03:13 1764021793

I did not say that I only ran the prompt once per attempt. When I say that second time it failed it means that I spent hours to restart, clear context, giving hints, everything to help the model to produce something that works.

thomasfromcdnjs · 2025-11-25T04:11:11 1764043871

You are really speaking to others points. Get a friend of yours to read what you are saying, it doesn't sound scientific in the slightest.

zsoltkacsandi · 2025-11-25T06:15:44 1764051344

I never claimed this was a scientific study. It was an observation repeated over time. That is empirical in the plain meaning of the word.

Criticizing it for “not being scientific” is irrelevant, I didn’t present it as science. Are people only allowed to share experiences here if they come wrapped in a peer-reviewed paper?

If you want to debate the substance of the observation, happy to. But don’t rewrite what I said into a claim I never made.

ewoodrich · 2025-11-24T21:59:35 1764021575

I was pretty disappointed to learn that the METR metric isn't actually evaluating a model's ability to complete long duration tasks. They're using the estimated time a human would take on a given task. But it did explain my increasing bafflement at how the METR line keeps steadily going up despite my personal experience coding daily with LLMs where they still frequently struggle to work independently for 10 minutes without veering off task after hitting a minor roadblock.

  On a diverse set of multi-step software and reasoning tasks, we record the time needed to complete the task for humans with appropriate expertise. We find that the time taken by human experts is strongly predictive of model success on a given task: current models have almost 100% success rate on tasks taking humans less than 4 minutes, but succeed <10% of the time on tasks taking more than around 4 hours. This allows us to characterize the abilities of a given model by “the length (for humans) of tasks that the model can successfully complete with x% probability”.

  For each model, we can fit a logistic curve to predict model success probability using human task length. After fixing a success probability, we can then convert each model’s predicted success curve into a time duration, by looking at the length of task where the predicted success curve intersects with that probability.

[1] https://metr.org/blog/2025-03-19-measuring-ai-ability-to-com...

ACCount37 · 2025-11-25T06:46:37 1764053197

It makes perfect sense to use human times as a baseline. Because otherwise, the test would be biased towards models with slower inference.

If model A generates 10 tokens a second and model B generates 100 tokens a second, then using real LLM inference time puts A at a massive 10x advantage, all other things equal.

ewoodrich · 2025-11-25T17:23:16 1764091396

But it doesn't evaluate the area that I am most eager to see improvements in LLM agent performance: unattended complex tasks that require adapting to unexpected challenges, problem solving and ambiguity for a long duration without a human steering them back in the right direction before they hit a wall or start causing damage.

If it takes me 8 hours to create a pleasant looking to-do app, and Gemini 3 can one shot that in 5 minutes, that's certainly impressive but doesn't help me evaluate whether I could drop an agent in my complex, messy project and expect it to successfully implement a large feature that may require reading docs, installing a new NPM package, troubleshooting DB configuration, etc for 30 min to 1 hr without going off the rails.

It's a legitimate benchmark, I'm not disputing that, but it unfortunately isn't measuring the area that could be a significant productivity multiplier in my day-to-day work. The METR time horizon score is still susceptible to the same pernicious benchmaxxing while I had previously hoped that it was measuring something much closer to my real world usage of LLM agents.

Improvements in long duration, multi-turn unattended development would save me lot of babysitting and frustrating back and forth with Claude Code/Codex. Which currently saps some of the enjoyment out of agentic development for me and requires tedious upfront work setting up effective rules and guardrails to work around those deficits.

ACCount37 · 2025-11-25T07:29:08 1764055748

There are many, many tasks that a given LLM can successfully do 5% of the time.

Feeling lucky?

blurbleblurble · 2025-11-24T21:33:06 1764019986

I'm working on a hard problem recently and have been keeping my "model" setting pegged to "high".

Why in the world, if I'm paying the loss leader price for "unlimited" usage of these models, would any of these companies literally respect my preference to have unfettered access to the most expensive inference?

Especially when one of the hallmark features of GPT-5 was a fancy router system that decides automatically when to use more/less inference resources, I'm very wary of those `/model` settings.

riwsky · 2025-11-25T05:42:32 1764049352

Because intentionally fucking over their customers would be an impossible secret to keep, and when it inevitably leaks would trigger severe backlash, if not investigations for fraud. The game theoretic model you’re positing only really makes sense if there’s only one iteration of the game, which isn’t the case.

jaggs · 2025-11-25T08:00:30 1764057630

That is unfortunately not true. It's pretty easy to mess with your customers when your whole product is as opaque as LLMs. I mean they don't even understand how they work internally.

roywiggins · 2025-11-25T00:50:47 1764031847

https://en.wikipedia.org/wiki/Regression_toward_the_mean

The way this works is:

1) x% of users have an exceptional first experience by chance. Nobody who has a meh first experience bothers to try a second time. 2) x²% of users also have an exceptional second experience by chance 3) So a lot of people with a great first experience think the model started off great and got suddenly worse

Suppose it's 25% that have a really great first experience. 25% of them have a great second experience too, but 75% of them see a sudden decline in quality and decide that it must be intentional. After the third experience this population gets bigger again.

So by pure chance and sampling biases you end up convincing a bunch of people that the model used to be great but has gotten worse, but a much smaller population of people who thought it was terrible but got better because most of them gave up early.

This is not in their heads- they really did see declining success. But they experienced it without any changes to the model at all.

quleap · 2025-11-25T08:56:47 1764061007

Your theory does not hold if a user initially had great experience for weeks and then had bad experience also for weeks.

unsupp0rted · 2025-11-25T01:11:34 1764033094

If by "second" and "third" experience you mean "after 2 ~ 4 weeks of all-day usage"

Wowfunhappy · 2025-11-25T17:41:09 1764092469

I think this is pretty easy to explain psychologically.

The first time you see a dog that can make pancakes, you’re really focused on the fact that a dog is making pancakes.

After a few weeks of having them for breakfast, you start to notice that the pancakes are actually kind of overcooked and don’t taste that good. Sure it’s impressive that a dog made them, but what use are sub-par pancakes? You’re naturally more focused on what it can’t do than what it can.

fragmede · 2025-11-25T04:43:12 1764045792

I'm not doubting you, but share the chats! it would make your point even stronger.