Hacker Newsnew | past | comments | ask | show | jobs | submit | augment_me's commentslogin

I am with you on this, and you can't win, because as soon as you voice this opinion you get overwhelmed with "you dont have the sauce/prompt" opinions which hold an inherent fallacy because they assume you are solving the same problems as them.

I work in GPU programming, so there is no way in hell that JavaScript tools and database wrapper tasks can be on equal terms with generating for example Blackwell tcgen05 warp-scheduled kernels.


There's going to be a long tail of domain-specific tasks that aren't well served by current models for the foreseeable future, but there's also no question the complexity horizon of the SotA models is increasing over time. I've had decent results recently with non-trivial Cuda/MPS code. Is it great code/finely tuned? Probably not but it delivered on the spec and runs fast enough.

> you can't win, because as soon as you voice this opinion you get overwhelmed with "you dont have the sauce/prompt"

> I've had decent results recently with non-trivial Cuda/MPS code.


In my experience LLMs are useless for GPU compute code, just not enough in the training set.

Yeah, the argument here is that once you say this, people will say "you just dont know how to prompt, i pass the PTX docs together with NSight output and my kernel into my agent and run an evaluation harness and beat cuBLAS". And then it turns out that they are making a GEMM on Ampere/Hopper which is an in-distribution problem for the LLMs.

It's the idea/mindset that since you are working on something where the tool has a good distribution, its a skill issue or mindset problem for everyone else who is not getting value from the tool.


Now please get back to coding GPU stuff so we can train our models on your code. Thank you.

Another thing I've never got them to generate is any G code. Maybe that'll be in the image/3d generator side indirectly, but I was kind of hoping I could generate some motions since hand coding coordinates is very tedious. That would be a productivity boost for me. A very very niche boost, since I rarely need bespoke G code, but still.

Oh HELL no. :P Gcode is (at least if you’re talking about machining) the very definition of something you want to generate analytically using tried and tested algorithms with full consideration taken for the specifics of the machine and material involved.

I guess if you just want to use it to wiggle something around using a stepper motor and a spare 3D printer control board, it might be OK though. :)


Anthropic has a challenge of optimizing GPU code.

The current leader is Opus 4.5

https://github.com/anthropics/original_performance_takehome


You did something smart and efficinently using the least amount of energy and time needed. +1 for consciousness being a mistake

Lmao using amendments as arguments in 2026

How many years has it been since there were any amendments? Representative government is dead.

I don't think that most research starts with the idea of being a crypto rugpull. Many research labs and startups fail, and that is fine, you dont have to double down and drag a bunch of people into the mud with you because of that, which is what a lot of the example the author points to.

In some sense I just feel like this is another way to gamble, which in general is seeing an unprecedented growth with Polymarket and the likes. There is less faith in white-collar skills making you rich, so you just try your luck.


This is the first question I ask, and every time I get the answer of some monolith that supposedly solves something. Imo, this is completely fine for any personal thing, I am happy when someone says they made an API to compare weekly shopping prices from the stores around them, or some recipe, this makes sense.

However more often than not, someone is just building a monolithic construction that will never be looked at again. For example, someone found that HuggingFace dataloader was slow for some type of file size in combination with some disk. What does this warrant? A 300000+ line non-reviewed repo to fix this issue. Not a 200-line PR to HuggingFace, no you need to generate 20% of the existing repo and then slap your thing on there.

For me this is puzzling, because what is this for? Who is this for? Usually people built these things for practice, but now its generated, so its not for practice because you made very little effort on it. The only thing I can see that its some type of competence signaling, but here again, if the engineer/manager looking knows that this is generated, it does not have the type of value that would come with such signaling. Either I am naive and people still look at these repos and go "whoa this is amazing", or it's some kind of induced egotrip/delusion where the LLM has convinced you that you are the best builder.


I noticed that despite really liking Karpathy and the blog, I was am kind of wincing/involuntarily reacting to the LLM-like "It's not X, its Y"-phrases:

> it's not just a website you go to like Google, it's a little spirit/ghost that "lives" on your computer

> it's not just about the image generation itself, it's about the joint capability coming from text generation

There would be no reaction from me on this 3 years ago, but now this sentence structure is ruined for me


I used to use a lot of em dashes normally in my writing - they were my go-to replacements for commas and semicolons

But I had to change how I write because people started calling my writing “AI generated”


2026 will be the year of the ;


Please no that's my go to


so you switched to using hyphens instead?


En dashes!


You’re absolutely right!

Jk jk, now that you pointed it out I can’t unsee it.


Yeah, came to read Karpathy's thoughts, but might as well ask an LLM myself..


Very broadly, AI sentence-structure and word choice is recursing back into society, changing how humans use language. The Economist recently had a piece on word usage of British Parliament members. They are adopting words and phrases commonly seen in AI.

We're embarking on a ginormous planetary experiment here.


> The Economist recently had a piece on word usage of British Parliament members. They are adopting words and phrases commonly seen in AI.

Many of the speeches given by MPs are likely to have been written beforehand, in whole or in part. Wouldn’t the more likely explanation be that they, or their staff, are using LLMs to write their speeches?


I hated these sentences way before LLMs, at least in the context of an explanation.

> it's not just a website you go like Google, it's a little spirit/ghost that "lives" on your computer

This type of sentence, I call rhetorical fat. Get rid of this fat and you obtain a boring sentence that repeats what has been said in the previous one.

Not all rhetorical fats are equal, and I must admit I find myself eyerolling on the "little spirit" part more than about the fatness.

I understand the author wants to decorate things and emphasize key elements, and the hate I feel is only caused by the incompatible projection of my ideals to a text that doesn't belong to me.

> it's not just about the image generation itself, it's about the joint capability coming from text generation.

That's unjustified conceptual stress.

That could be a legitimate answer to a question ("No, no, it's not just about that, it's more about this"), but it's a text. Maybe the text wants you to be focused, maybe the text wants to hype you; this is the shape of the hype without the hype.

"I find image generation is cooler when paired with text generation."


It is not a decoration. Karpathy juxtaposes ChatGPT (which feels like a "better google" to most people) to Claude Code, which, apparently, feels different to him. It's a comparison between the two.

You might find this statement non-informative, but without two parts there's no comparison. That's really the semantics of the statement which Karpathy is trying to express.

ChatGPT-ish "it's not just" is annoying because the first part is usually a strawman, something reader considers trite. But it's not the case here.


Indeed, I was probably grumpy at the time I wrote the comment. I do find some truth in it still.

You're right ! The strawman theory is based.

But I think there's more to it, I find dislikable the structure of these sentences (which I find a bit sensationnalist for nothing, I don't know, maybe I am still grumpy).


Well, language is a subject to 'fashion' one-upmanship game: people want to demonstrate their sophistication, often by copying some "cool" patterns, but then over-used patterns become "uncool" cliche.

So it might be just a natural reaction to over-use of a particular pattern. This kind of stuff have been driving language evolution for millennia. Besides that, pompous style is often used in 'copy' (slogans and ads) which is something most people don't like.


Karpathy should go back to what he does best: educating people about AI on a deep level. Running experiments and sharing how they work, that sort of stuff. It seems lately he is closer to an influencer who reviews AI-based products. Hopefully it is not too late to go back.


I feel these review stuff is more like a side / pass time to him. Look at nanochat for example. My impression is that these are the thongs he spends most of his energy still.

After all,l he's been a "influencer" for a long time, starting from the "software 2.0" essay.


We need to integrate how Singapore and Japan do oral English into our writing I guess.

Joking aside, as a nonnative English speaker who spent quite a bit of time to learn to write in English "properly", this trend of needing to write baad Engrish to avoid being called out in public for "written by an LLM" is frustrating...


I cannot unsee this anymore and it ruins the whole internet experience for me


Same here, had to configure ChatGPT to stop making these statements. Also had to configure bunch of other stuff to make it bland when answering questions.


The way to make AI not sound like ChatGPT is to use Claude.

I realized that's what bothered me. It's not "oh my god, they used ChatGPT." But "oh my god, they couldn't even be bothered to use Claude."

It'll still sound like AI, but 90% of the cringe is gone.

If you're going to use AI for writing, it's just basic decency to use the one that isn't going to make your audience fly into a fit of rage every ten seconds.

That being said, I feel very self conscious using emdashes in current decade ;)


If a reader gets angry simply because the author used ChatGPT instead of Claude, then the reader is an idiot.


I dont think Ive ever noticed someone use an emdash until chatgpt appeared


That's because people didn't make it a point to performatively notice them. But e.g. macOS and iOS have been auto-inserting them for a long time now. Ditto Word.


I didnt know this thanks


https://xkcd.com/3126/

I mostly use them in Telegram because it auto converts -- into emdash. They are a pain to type everywhere else though!


I love em dashes—they basically indicate a more deliberate pause than a … without the tight vibes of a semicolon.


I don't use LLM for writing just factual research stuff. And this would happen even in those questions.


Same, I cringe when I read this structure.


It's not text - it's clickbait distillied to grammar.


All software is not meant to be open-source, in production and working on 100 platforms.

Sometimes the point of the software is to make an app with 2 buttons for your mom to help her do her grocery shopping easier


The text structure screams GPT5 sadly, so I would not be surprised if not only the text but the images were wrong.


Your "research" is a vibe-coded mess that subtly cheats eval cleverly multiple times to inflate your results.

The HellaSwag dataset is a dataset with 4 options for each question, with 3 being wrong and 1 being right: https://huggingface.co/datasets/Rowan/hellaswag.

Your vibe-coded eval has cheated this to collapse it into a binary selection on row 46 in https://github.com/Anima-Core/an1-core/blob/main/experiments..., making the problem baseline 50% on random choice instead of 25%, making the problem much easier. HellaSwag is specifically constructed with adversarial examples that could be plausible. By not including them, the eval is much easier.

---

Then, in extract_fields_from_model, you have another cheating going on. The extraction logic (h[:, -1, :]) fails to account for padding in batches, likely extracting EOS/Pad tokens instead of the intended content tokens. This suggests the probe is relying on global sentence summaries (standard embeddings in causal structures) rather than the novel 'meaning fields' claimed in the paper.

---

I dont have time to look at more of this and I just looked at how the eval is made, but please dont waste peoples times when you dont even know what you are evaluating.


I guess my "vibe" is just better than your coding :)... Let me explain a few things, if you will. A few clarifications so the discussion stays aligned with what the experiment is actually measuring.

1. The HellaSwag “binary collapse” is intentional and not a leaderboard claim. This work doesn’t attempt to benchmark HellaSwag in the standard four-choice setting. The goal is to probe whether a single frozen layer carries enough information for a small head to distinguish correct versus incorrect continuations. That's a representational geometry test, not a SOTA claim. Binary framing raises the baseline, but that's expected and documented. It's not meant to compare against full LLM HellaSwag results.

2. No adversarial filtering was done. I am using HuggingFace’s standard split directly. Nothing was removed or curated. The experiment doesn't claim robustness or benchmark competitiveness, so the “easier eval” framing doesn’t really apply.

3. EOS extraction isn't cheating, it's the whole point of the probe. The extraction logic takes the final token’s hidden state, which is basic and standard for classification heads and probing studies. If the EOS token captures a high-level sequence summary, that's exactly the structural feature being examined. The result is meant to show how much task-relevant signal is already present in that early representation, not to present a new generative mechanism.

4. The purpose of the work is clearly narrow by design. This is not proposed as a drop-in replacement for full-transformer inference. The paper states that directly. The contribution is about how much structure a single early layer encodes and how far a tiny head can go under strict frozen-teacher constraints. So several of the criticisms make assumptions about goals the work never even claimed.

Thaank you for the feedback and for taking the time.


I dont know if you are trying to delude yourself or someone else with your Motte-and-Bailey fallacy(https://en.wikipedia.org/wiki/Motte-and-bailey_fallacy), but it doesn't work when you are literally advertising 4 classes for HellaSwag on the website for the product:

https://www.animacore.ai/

As well as literally writing out "CUDA-compatible drop-in".

Look at your post being flagged, and think for yourself what you are actually doing. Seems to be some kind of LLM-induced psychosis, here is a good read that could ground you: https://www.lesswrong.com/posts/rarcxjGp47dcHftCP/your-llm-a...


I can see you've put real thought into your critique, and while I definitely disagree with several conclusions, I appreciate the seriousness of the discussion. Hopefully this is a good faith discussion, and we can keep it that way.

Let me start with the Motte-and-Bailey point, since that seems to be the crux of your argument.

For anyone unfamiliar, a motte-and-bailey fallacy is when someone makes a bold or controversial claim, then retreats to a weaker, safer claim under pressure while pretending the two were always the same. That's simply not what's happening here in the slightest.

The confusion begins with a misreading of the title. Which, in hindsight, I agree should have been clearer so that the work was being critiqued rather than semantics. (Although the paper is clear on this distinction.)

“Post-Transformer Inference” does not mean no transformer, nor does it mean replacement of transformers. It refers to where inference is performed in the pipeline. The transformer remains fully intact and unchanged. It's used exactly as intended. To extract representations. The contribution begins after that point.

The paper is explicit about this throughout:

The transformer is fully used and not replaced.

The compressed heads are task-specific and not general LLM substitutes.

The 224× compression applies to task-specific inference paths, NOT to the base model weights.

There's no shift in scope, no retreat, and no weaker fallback claim. The boundary is fixed and stated clearly.

On HellaSwag and the “4 classes” point, this is simply a category error. HellaSwag is a four-choice benchmark by definition. Advertising four classes describes the label space of the task, not the capacity of the model. Compression here refers to internal representations and compute required for inference, not to the number of output labels. Those are different layers of the system.

The same applies to “CUDA-compatible drop-in.” That phrase refers to integration, not equivalence. It means this work can plug into existing CUDA-based pipelines without requiring teams to rewrite or replace their infrastructure. It absolutely does not claim semantic equivalence to CUDA kernels, nor does it claim GPU replacement. The goal is to extract value without forcing anyone to rebuild their stack. That distinction is intentional and explicit.

You also cited the LessWrong essay, which I'm very familiar with and broadly agree with in spirit. It's a valid warning about vague, unfalsifiable, or scope-shifting claims in LLM-assisted research. That critique applies when claims move or evidence is absent. Here, the claims are narrow, fixed, and empirically evaluated, with code and benchmarks available. Disagree with the results if you want, but that essay just isn't describing this situation at all.

As for the flagging. That's easy. There's nothing mysterious about it. Work that challenges familiar abstractions often gets flagged first for language, not for results. Titles that suggest a different inference boundary tend to trigger skepticism before the experiments are actually read. That doesn't mean the work isn't correct, and it would be wrong to assume that.

Flagging isn't peer review. Real critique points to broken assumptions, flawed metrics, or reproducibility failures.

Again, I will freely admit the title was designed to be punchy, and while it's technically accurate, I can see now how it invites semantic confusion. That is totally fair feedback, and I will refine that framing going forward. That doesn't make the results wrong, nor does it make this a motte-and-bailey.

If you want to talk about the data, the methodology, or where this work is heading next, I'm more than happy to do that. I suspect some of the disagreement here is less about intent and more about where you think the boundary of the system is. Once that clicks, the rest tends to fall into place.


I think the paper in general completely oversells the idea of "universality".

For CNNs, the 'Universal Subspace' is simply the strong inductive bias (locality) forcing filters into standard signal processing shapes (Laplacian/Gabor) regardless of the data. Since CNNs are just a constrained subset of operations, this convergence is not that surprising.

For Transformers, which lack these local constraints, the authors had to rely on fine-tuning (shared initialization) to find a subspace. This confirms that 'Universality' here is really just a mix of CNN geometric constraints and the stability of pre-training, rather than a discovered intrinsic property of learning.


For me at least, I wasn't even under the impression that this was a possible research angle to begin with. Crazy stuff that people are trying, and very cool too!


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: