Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I get what you're trying to say, but I don't entirely agree. Raising levels of abstraction is generally a good thing. But up until now, those have mostly been deterministic. We can be mostly confident that the compiler will generate correct machine code based on correct source code. We can be mostly confident that the magnetised needle does the right thing.

I don't think this is true for LLMs. Their output is not deterministic (up for discussion). Their weights and the sources thereof are mostly unknown to us. We cannot really be confident that an LLM will produce correct output based on correct input.



I agree with you but I want to try to define the language better.

It's not that LLMs aren't deterministic, because neither are many compilers.

It's also not that LLMs produce incorrect output, because compilers do that to, sometimes.

But when a compiler produces the wrong output, it's because either (1) there's a logic error in my code, or (2) there's a logic error in the compiler†, and I can drill down and figure out what's going on (or enlist someone to help me) to fix the problem.

Let's say I tell an LLM to write a algorithm, and it produces broken code. Why didn't my prompt work? How do I fix it? Can anyone ever actually know? And what did I learn from the experience?

---

† Or I guess there could be a hardware bug. Whatever. I'm going to blame the compiler because it needs to produce bytes that work on my silicon regardless of whether the silicon makes sense.


Compilers are deterministic


This is in general only true for either trivial toy compilers or ones which have gone to lengths to have reproducible builds. GCC for instance uses a randomised branch prediction model in some circumstances.


Ok, but my understanding is that that they are mostly deterministic. And that there are initiatives like Reproducible Builds (https://reproducible-builds.org) that try to move even more in that direction.


But what does "mostly" mean? You can compile the same code twice and literally get two different binaries. The bits don't match.

Sure, those collections of bits tend to do exactly the same thing when executed, but that's is in some sense a subjective evaluation.

---

Szundi said in a sibling comment that I was "completely [missing] the point on purpose" by bringing up compiler determinism. I think that's fair, but it's also why I opened my post by saying "I agree [with the parent], but I want to try to define the language better." Most compilers in use today are literally not deterministic, but they are deterministic in a different sense, which is useful as a comparison point to LLMs. Well, which sense? What is the fundamental quality that makes a compiler more predictable?

I'd like to try to find the correct words, because I don't think we have them yet.


I'm not an compiler expert, not by far. But my understanding is that if you compile the same code on the same machine for the same target, you'll get the same bits. Only minor things like timestamps that are sometimes introduced might differ. In this sense, maybe they are not deterministic. But I think it's fair to classify them as "determinstic" compared to LLMs.


Arguing with compilers about LLM determinism is not really adequate as an analogy, completely misses the point on purpose


I’d say it’s not only determinism, but also the social contract that’s missing.

When I’m calling ‘getFirstChar’ from a library, me and the author have a good understanding of what the function does based on a shared context of common solutions in the domain we’re working in.

When you ask ChatGPT to write a function that does the same, your social contract is between you and untold billions of documents that you hope the algorithm weights correctly according to your prompt (we should probably avoid programming by hope).

You could probably get around this by training on your codebase as the corpus, but until we answer all the questions about what that entails it remains, well, questionable.


> we should probably avoid programming by hope

I use Cursor at work, which is basically VSCode + LLM for code generation. It's a guess and check, basically. Plenty of people look up StackOverflow answers to their problem, then verify that the answer does what they want. (Some people don't verify but those people are probably not good programmers I guess.) Well, sometimes I get the LLM to complete something, then verify the code is completed is what I would have written (and correct it if not). This saves time/typing for me in the long run even if I have to correct it at times. And I don't see anything wrong with this. I'm not programming by hope, I'm just saving time.


This increases the time you spend proofing other’s work (tedious) versus time you spend developing a solution in code (fun). Also, if the LLM output is correct 95% of the time, one tends to get more sloppy with the checking, as it will feel unnecessary most of the time.


> This increases the time you spend proofing other’s work (tedious) versus time you spend developing a solution in code (fun).

I find that I don't use it as much for generating code as I do for automating tedious operations. For example, moving a bunch of repeating-yourself into a function, then converting the repeating blocks into function calls. The LLM's really good at doing that quickly without requiring me to perform dozens of copy-paste operations, or a bunch of multi-cursor-fu.

Also, I don't use it to generate large blocks of code or complicated logic.


Just what I was thinking about lately, what if LLMs are not 95% precise, but 99,95%. After like 50-100 checks you find nothing, and you just dump the whole project to be implemented - and there come the bugs.

However ... your colleagues just do the same.

We'll see how this unfolds. As for now the industry seems to be a bit stuck at this level. Big models too expensive to train for marginal gains, smaller are getting better but doesn't help this. Until some one new idea comes in how LLMs should work, we won't see the 99.95% anyway.


one idea is obvious: multi-model approach. it partially done today for safety checks. the same can be done for correctness. one model produces result, different model only checks the correctness. optionally several results, second model checks correctness and selects the best. this is more expensive, but should give better final output. not sure, this may have been already done.


Yeah, I’m more worried about the middle ground that would make software quality (even) worse than it is today.


> We can be mostly confident that the compiler will generate correct machine code based on correct source code.

Recently got email about gcc 14.2, they fixed some bugs in it. Can we trust it now, these could be the last bugs. But before that it was probably a bad idea to trust. No, even compiler's output requires extensive testing. Usually it's done at once, just final result of coding and compilation.

> Their output is not deterministic

yes.

> Their weights and the sources thereof are mostly unknown to us

Some of them are known. Does it make you feel better. There are too many weights, so you are not able to track its 'thinking' anyway. There are some tools which sort of show something. Still doesn't help much.

> We cannot really be confident that an LLM will produce correct output based on correct input

No, we can't. But it's so useful when it works. I'm using it regularly for small utilities and fun pictures. Even though it can give outright wrong answers for relatively simple math questions. With explanations and full confidence.


For the average programmer the infinite layers of abstraction, libraries and middleware isn't deterministic either. The LLMs actually honest to god being probabilistic estimators doesn't change anything about what they produce or how they see their own stuff.


> We cannot really be confident that an LLM will produce correct output based on correct input.

There are 2 things at play here, one is LLM with human in the loop, in which it's just a tool for programmers to do the same thing they have been doing, and the other is LLM as black box automaton. For the former, it's not a problem that the tool is undeterministic, we are double checking the results and add our manual labour anyway. The fact that a tool can fail sometimes is an unsurprising fact of engineering.

I think the criticism in this chain of comment applies more to the latter, but even it always has values to non-tech people, just like how no-code approaches are, however shitty it looks to us software enfineers.


I don't know. Programming with an LLM turns every line of code into legacy code that you have to maintain and debug and don't fully grok because you didn't write it yourself.


If it's in your PR then you wrote it, no one should be approving code they do not understand whether that's from AI or googling. Nothing changes there.


What if the we can't do abstraction anymore. I mean we certainly can but we will loose ability to configure tiny details of system.

So other path forward could very well be LLMs as they can save lot of time with writing boilerplate code


So what if the output is stochastic? LLM's have self consistency, so you can repeat the inference several times and pick the most frequent output.


Most frequent output does not imply correctness, LLMs often are confidently wrong.

They can't even perform basic arithmetic (which is not surprising since they operate at the syntactic level, oblivious to any semantic rules), yet people seem to think offloading more complex tasks with strict correctness requirements is a good idea. Boggles the mind tbh.


And you can pay in time and/or $$ for the privilege of having to do do this extra unnecessary work.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: