Hacker Newsnew | past | comments | ask | show | jobs | submit | jaccola's commentslogin

I would hate to be one of the ~80 million people in the world who have identical faces

All of the latest models I've tried actually pass this test. What I found interesting was all of the success cases were similar to:

e.g. "Drive. Most car washes require the car to be present to wash,..."

Only most?!

They have an inability to have a strong "opinion" probably because their post training, and maybe the internet in general, prefer hedged answers....


Here’s my take: boldness requires the risk of being wrong sometimes. If we decide being wrong is very bad (which I think we generally have agreed is the case for AIs) then we are discouraging strong opinions. We can’t have it both ways.

Last year's models were bolder. Eg. Sonnet-3.7(thinking), 10 times got it right without hedging:

>You should drive your car to the car wash. Even though it's only 50 meters away (which is very close), you'll need your car physically present at the car wash to get it washed. If you walk there, you'll arrive without your car, which wouldn't accomplish your goal of getting it washed.

>You'll need to drive your car to the car wash. While 50 meters is a very short distance (just a minute's walk), you need your car to actually be at the car wash to get it washed. Walking there without your car wouldn't accomplish your goal!

etc. The reasoning never second-guesses it either.

A shame they're turning it of in 2 days.


[flagged]


You know what they mean by opinions. Policing speech like this is always counter productive.

yet the llms seem to be extremely bold when they are completely wrong (two Rs in strawberry and so on)

> They have an inability to have a strong "opinion" probably

What opinion? It's evaluation function simply returned the word "Most" as being the most likely first word in similar sentences it was trained on. It's a perfect example showing how dangerous this tech could be in a scenario where the prompter is less competent in the domain they are looking an answer for. Let's not do the work of filling in the gaps for the snake oil salesmen of the "AI" industry by trying to explain its inherent weaknesses.


Presumably the OP scare quoted "opinion" precisely to avoid having to get into this tedious discussion.

this example worked in 2021, it's 2026. wake up. these models are not just "finding the most likely next word based on what they've seen on the internet".

Well, yes, definitionally they are doing exactly that.

It just turns out that there's quite a bit of knowledge and understanding baked into the relationships of words to one another.

LLMs are heavily influenced by preceding words. It's very hard for them to backtrack on an earlier branch. This is why all the reasoning models use "stop phrases" like "wait" "however" "hold on..." It's literally just text injected in order to make the auto complete more likely to revise previous bad branches.


The person above was being a bit pedantic, and zealous in their anti-anthropomorphism.

But they are literally predicting the next token. They do nothing else.

Also if you think they were just predicting the next token in 2021, there has been no fundamental architecture change since then. All gains have been via scale and efficiency optimisations (not to discount that, an awful lot of complexity in both of these)


That's not what they said. They said:

> It's evaluation function simply returned the word "Most" as being the most likely first word in similar sentences it was trained on.

Which is false under any reasonable interpretation. They do not just return the word most similar to what they would find in their training data. They apply reasoning and can choose words that are totally unlike anything in their training data.

If you prompt it:

> Complete this sentence in an unexpected way: Mary had a little...

It won't say lamb. Any if you think whatever it says was in the training data, just change the constraints until you're confident it's original. (E.g. tell it every word must start with a vowel and it should mention almonds.)

"Predicting the next token" is also true but misleading. It's predicting tokens in the same sense that your brain is just minimizing prediction error under predictive coding theory.


You are actually proving my point with your example, if you think about it a bit more.

If there is no response it could give that will disprove your point, then your belief is unfalsifiable and your point is meaningless.

Huh?

Were you talking about the "Mary had a little..." example? If not, I have no idea what you're trying to say.

Unless LLMs architecture have changed, that is exactly what they are doing. You might need to learn more how LLMs work.

Unless the LLM is a base model or just a finetuned base model, it definitely doesn't predict words just based on how likely they are in similar sentences it was trained on. Reinforcement learning is a thing and all models nowadays are extensively trained with it.

If anything, they predict words based on a heuristic ensemble of what word is most likely to come next in similar sentences and what word is most likely to give a final higher reward.


> If anything, they predict words based on a heuristic ensemble of what word is most likely to come next in similar sentences and what word is most likely to give a final higher reward.

So... "finding the most likely next word based on what they've seen on the internet"?


Reinforcement learning is not done with random data found on the internet; it's done with curated high-quality labeled datasets. Although there have been approaches that try to apply reinforcement learning to pre-training[1] (to learn in an unsupervised way a predict-the-next-sentence objective), as far as I know it doesn't scale.

[1] https://arxiv.org/pdf/2509.19249


You know that when A. Karpathy released NanoLLM (or however it was called), he said it was mainly coded by hand as the LLMs were not helpful because "the training dataset was way off". So yeah, your argumentation actually "reinforces" my point.

No, your opinion is wrong because the reason some models don't seem to have some "strong opinion" on anything is not related to predicting words based on how similar they are to other sentences in the training data. It's most likely related to how the model was trained with reinforcement learning, and most specifically, to recent efforts by OpenAI to reduce hallucination rates by penalizing guessing under uncertainty[1].

[1] https://cdn.openai.com/pdf/d04913be-3f6f-4d2b-b283-ff432ef4a...


Well, you do understand the "penalising" or as the ML scientific community likes to call it - "adjusting the weights downwards" - is part of setting up the evaluation functions, for gasp - calculating the next most likely tokens, or to be more precise, tokens with the highest possible probability? You are effectively proving my point, perhaps in a bit hand-wavy fashion, that nevertheless still can be translated into the technical language.

You do understand that the mechanism through which an auto-regressive transformer works (predicting one token at a time) is completely unrelated to how a model with that architecture behaves or how it's trained, right? You can have both:

- An LLM that works through completely different mechanisms, like predicting masked words, predicting the previous word, or predicting several words at a time.

- A normal traditional program, like a calculator, encoded as an autoregressive transformer that calculates its output one word at a time (compiled neural networks) [1][2]

So saying "it predicts the next word" is a nothing-burger. That a program calculates its output one token at a time tells you nothing about its behavior.

[1] https://arxiv.org/pdf/2106.06981

[2] https://wengsyx.github.io/NC/static/paper_iclr.pdf


> So saying "it predicts the next word" is a nothing-burger. That a program calculates its output one token at a time tells you nothing about its behavior.

Well it does - it tells me it is utterly un-reliable, because it does not understand anything. It just merely goes on, shitting out a nice pile of tokens that placed one after another kind of look like coherent sentences but make no sense, like "you should absolutely go on foot to the car wash". A completely logical culmination of Bill Gates' idiotic "Content is King" proclamation of 20 years ago.


No, you can't know that the output of a program is unreliable just from the fact that it outputs one words at a time. I already told you that you can perfectly compile a normal program, like a calculator, into the weights of an autoregressive transformer (this comes from works like RASP, ALTA, tracr, etc). And with this I don't mean it in the sense of "approximating the output of a calculator with 99.999% accuracy", I mean it in the sense of "it deterministically gives exactly the same output as a calculator 100% of the time for all possible inputs".

> No, you can't know that the output of a program is unreliable just from the fact that it outputs one words at a time

Yes I can, and it shows everytime the "smart" LLMs suggest us to take a walk to the carwash or suggests 1.9 < 1.11 etc...


Did you try several times per model? In my experience it's luck of the draw. All the models I tried managed to get it wrong at least once.

The models that had access to search got ot right.But, then were just dealing with an indirect version of Google.

(And they got it right for the wrong reasons... I.e this is a known question designed to confuse LLMs)


I guess it didn’t want to rule out the existence of ultra-powerful water jets that can wash a car in sniper mode.

I enjoyed the Deepseek response that said “If you walk there, you'll have to walk back anyway to drive the car to the wash.”

There’s a level of earnestness here that tickles my brain.


Opus 4.6 answered with "Drive." Opus 4.6 in incognito mode (or whatever they call it) answered with "Walk."

They pass it because it went viral a week ago and has been patched

I tried with Opus 4.6 Extended and it failed. LLMs are non deterministic so I'm guessing if I try a couple of times it might succeed.

>Only most?!

There is such a thing as "mobile car wash" where they come to you, so "most" does seem appropriate.


Right, I use it all the time.

There are car wash services that will come to where your car is and wash it. It’s not wrong!

Kind of like this: https://xkcd.com/1368/

And it is the kind of things a (cautious) human would say.

For example, that could be my reasoning: It sounds like a stupid question, but the guy looked serious, so maybe there are some types of car washes that don't require you to bring your car. Maybe you hand out the keys and they pick your car, wash it, and put it back to its parking spot while you are doing your groceries or something. I am going to say "most" just to be sure.

Of course, if I expected trick questions, I would have reacted accordingly, but LLMs are most likely trained to take everything at face value, as it is more useful this way. Usually, when people ask questions to LLMs they want an factual answer, not the LLM to be witty. Furthermore, LLMs are known to hallucinate very convincingly, and hedged answers may be a way to counteract this.


> Most car washes... I read it as slight-sarcasm answer

There are mobile car washes that come to your house.

Do they involve you walking to them first?

You could, but presumably most people call. I know of such a place. They wash cars on the premises but you could walk in and arrange to have a mobile detailing appointment later on at some other location.

That still requires a car present to be washed though.

but you can walk over to them and tell them to go wash the car that is 50 meters away. no driving involved.

opus 4.6 extended still fails.

> Only most?!

I mean I can imagine a scenario where they have pipe of 50m which is readily available commercially?


> Only most?!

What if AI developed sarcasm without us knowing… xD


That's the problem with sarcasm...

Sure it did.

Once I asked ChatGPT "it takes 9 months for a woman to make one baby. How long does it take 9 women to make one baby?". The response was "it takes 1 month".

I guess it gives the correct answer now. I also guess that these silly mistakes are patched and these patches compensate for the lack of a comprehensive world model.

These "trap" questions dont prove that the model is silly. They only prove that the user is a smartass. I asked the question about pregnancy only to to show a friend that his opinion that LLMs have phd level intelligence is naive and anthropomorphic. LLMs are great tools regardless of their ability to understand the physical reality. I don't expect my wrenches to solve puzzles or show emotions.


I have no idea how an LLM company can make any argument that their use of content to train the models is allowed that doesn't equally apply to the distillers using an LLM output.

"The distilled LLM isn't stealing the content from the 'parent' LLM, it is learning from the content just as a human would, surely that can't be illegal!"...


The argument is that converting static text into an LLM is sufficiently transformative to qualify for fair use, while distilling one LLM's output to create another LLM is not. Whether you buy that or not is up to you, but I think that's the fundamental difference.

The whole notion of 'distillation' at a distance is extremely iffy anyway. You're just training on LLM chat logs, but that's nowhere near enough to even loosely copy or replicate the actual model. You need the weights for that.

> The U.S. Court of Appeals for the D.C. Circuit has affirmed a district court ruling that human authorship is a bedrock requirement to register a copyright, and that an artificial intelligence system cannot be deemed the author of a work for copyright purposes

> The court’s decision in Thaler v. Perlmutter,1 on March 18, 2025, supports the position adopted by the United States Copyright Office and is the latest chapter in the long-running saga of an attempt by a computer scientist to challenge that fundamental principle.

I, like many others, believe the only way AI won't immediately get enshittified is by fighting tooth and nail for LLM output to never be copyrightable

https://www.skadden.com/insights/publications/2025/03/appell...


Thaler v. Perlmutter is an a weird case because Thaler explicitly disclaimed human authorship and tried to register a machine as the author.

Whereas someone trying to copyright LLM output would likely insist that there is human authorship is via the choice of prompts and careful selection of the best LLM output. I am not sure if claims like that have been tested.


The US copyright office has published a statement that they see AI output analogous to a human contracting the work out to a machine. The machine would hold the copyright, but can't, consequently there is none. Which is imho slightly surprising since your argument about choice of prompt and output seems analogous to the argument that lead to photographs being subject to copyright despite being made by a machine.

On the other hand in a way the opinion of the US copyright office doesn't matter, what matters is what the courts decide


It's a fine line that's been drawn, but this ruling says that AI can't own a copyright itself, not that AI output is inherently ineligible for copyright protection or automatically public domain. A human can still own the output from an LLM.

> A human can still own the output from an LLM.

It specifically highlights human authorship, not ownership


>I, like many others, believe the only way AI won't immediately get enshittified is by fighting tooth and nail for LLM output to never be copyrightable

If the person who prompted the AI tool to generate something isn't considered the author (and therefore doesn't deserve copyright), then does that mean they aren't liable for the output of the AI either?

Ie if the AI does something illegal, does the prompter get off scot-free?


When you buy, or pirate, a book, you didn't enter into a business relationship with the author specifically forbidding you from using the text to train models. When you get tokens from one of these providers, you sort of did.

I think it's a pretty weak distinction and by separating the concerns, having a company that collects a corpus and then "illegally" sells it for training, you can pretty much exactly reproduce the acquire-books-and-train-on-them scenario, but in the simplest case, the EULA does actually make it slightly different.

Like, if a publisher pays an author to write a book, with the contract specifically saying they're not allowed to train on that text, and then they train on it anyway, that's clearly worse than someone just buying a book and training on it, right?


> When you buy, or pirate, a book, you didn't enter into a business relationship with the author specifically forbidding you from using the text to train models.

Nice phrasing, using "pirate".

Violating the TOS of an LLM is the equivalent of pirating a book.


Contracts can't exclude things that weren't invented when the contracts were written.

Ultimately it's up to legislation to formalize rules, ideally based on principles of fairness. Is it fair in non-legalistic sense for all old books to be trainable-on, but not LLM outputs?


Because the terms by each provider are different

American Model trains on public data without a "do not use this without permission" clause.

Chinese models train on models that have a "you will not reverse engineer" clause.


> American Model trains on public data without a "do not use this without permission" clause.

this is going through various courts right now, but likely not


Not really your point but I think the skills to create these things are much slower to train than producing chips and data centres.

So they couldn't really build any of these projects weekly since the cost of construction materials / design engineers / construction workers would inflate rapidly.

Worth keeping in mind when people say "we could have built 52 hospitals instead!" or similar. Yes, but not really... since the other constraints would quickly reveal themselves


I think this is cool!

But by some definition my "Ctrl", "C", and "V" keys can build a C compiler...

Obviously being facetious but my point being: I find it impossible to judge how impressed I should be by these model achievements since they don't show how they perform on a range of out-of-distribution tasks.


Even that is underselling it; jobs are a necessary evil that should be minimised. If we can have more stuff with fewer people needing to spend their lives providing it, why would we NOT want that?


Because we've built a system where if you don't have a job, you die.


This is already hyperbolic; in most countries where software engineers or similar knowledge workers are widely employed there are welfare programmes.

To add to that, if there is such mass unemployment in this scenario it will be because fewer people are needed to produce and therefore everything will become cheaper... This is the best kind of unemployment.

So at best: none of us have to work again and will get everything we need for free. At worst, certain professions will need a career switch which I appreciate is not ideal for those people but is a significantly weaker argument for why we should hold back new technology.


Most of those welfare programs aren't very good, and most of that is on purpose, to make people get jobs at whatever cost.


If you were to rank all of the C compilers in the world and then rank all of the welfare systems in the world, this vibe-coded mess would be at approximately the same rank as the American welfare system. Especially if you extrapolate this narcissistic, hateful kleptocracy out a few more years.


Did we build it or did nature?


We did.


Yeah but who can be hurt by this, these are both private companies? So whose interest is his "conflicting" with? I'm sure the shareholders will raise it with him and/or bring a lawsuit if they aren't happy (they probably are happy).


What $$$?! The top tier of apple one is £36.95pm. If I spend 15 mins of time every month extra self hosting then it’s immediately not worth it. (Not to mention self hosting won’t be free).

Also, for that price I get: 2TB cloud storage,Apple TV,Apple Music,news,workouts,arcade most of which cannot be self hosted.

Economies of scale are real, it’s possible Apple makes a ton of money and the user is getting a good deal!


The initial tweet was primarily a lie though

> The rendering engine is from-scratch in Rust with HTML parsing, CSS cascade, layout, text shaping, paint, and a custom JS VM.

If I cloned Pixar’s rendering library and called that then added to my CV ‘built a renderer from scratch’ this would be entirely dishonest…

I use LLMs often and don’t hate Cursor or think they’re a bad company. But it’s obvious they are being squeezed and have little USP (even less so than other AI players). They are frankly extremely pressured to make up lies.

I don’t think I’d resist the pressure either, so not on a high horse here, but it doesn’t make it any less dishonest.


Interestingly, the UK PM (and allies) just blocked a would-be political rival Andy Burnham standing as an MP.

One of the given reasons is because Burnham is currently mayor of Greater Manchester, and running a new election there would cost approx £4m(!!) which is a huge waste of taxpayer money.

I was surprised that they even gave this as a faux reason since it seems like the sort of money they would spend on replenishing the water coolers, or buying bic pens, or... building a static website!


Tangentially, Burnham has a long history with these sorts of public-sector private vampires, having been up to his neck in PFI (of "£200 to change a lightbulb" fame) in his stint leading the NHS.

eg.

https://www.theguardian.com/uk/2012/jun/28/labour-debt-peter...

https://doctorsforthenhs.org.uk/the-truth-about-the-lies-tha...

etc


And that's just it. Vampiric.

The fact that a huge amount of money is extracted from the UK government for no (or very little value) is a crying shame.

I know multiple people who work as consultants (hired via private agencies, paid for by Government) who have literally done nothing for six months plus.

They have no incentive to whistleblow, the agency employing them has no incentive to get rid of them as they take a cut, and then government department hiring them is non-the-wiser because they have no technical knowledge or understanding of what's being carried out.

It should be the scandal of the decade.


Being cynical i would say it's because Burnham could potentially challenge Starmer. Less cynically Labour has a big enough majority they can afford to lose this by election. The headache of replacing the mayor of Manchester is not worth it.

Why can't he just do both jobs? Boris did it iirc.


If memory serves, Dan Jarvis also did it, being both MP and mayor of the South Yorkshire city region or whatever it was called at the time.

It is fairly innately political. No Prime Minister has ever polled as low as Starmer and come back from it, or so is being said in the press. Burnham might be a smart electoral move, but he's not a plaything of the Labour right, so they kept him out.


The rules are inconsistent. You can be Mayor of Sheffield and an MP at the same time but you can’t be Mayor of Greater Manchester and an MP.


That's not inconsistency in the rules, that's inconsistency in what being the mayor means. In Sheffield it means you show up wearing funny clothes every so often, in Greater Manchester it means you have a full-time job, a large budget, and actual responsibilities.

For our American brethren, it's like the difference between being the Mayor of NYC vs the Macy’s Thanksgiving Day Parade King.


It's actually the role of Police and Crime Commissioner that prevents them from being an MP simultaneously. In Greater Manchester (and London) the PCC role is combined with that of Mayor, but it isn't in most other city regions.

There's not much actual difference in the mayoral aspect of the roles - Jarvis was the Mayor of the South Yorkshire Combined Authority, not simply the mayor of Sheffield City Council.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: