On the far end of this debate you end up with types like _RelationshipJoinConditionArgument which I'd argue is almost more useless than no typing at all. Some people claim it makes their IDE work better, but I don't use an IDE and I don't like the idea of doing extra work to make the tool happy. The opposite should be true.
Literally dealing with this right now. My wife got what appears to be a (very expensive) counterfeit item that is technically non-returnable (not laying down without a fight). Kind of cathartic to see this pop up.
I don’t think the person quoted is implying that it should be that way, merely pointing out a discovery that builders have made: they _can_ get a symbolic bonus. One can skip building to code… do a quick and bad job and move on to the next job, saving cost and moving onto the next paying job more quickly. That “bonus” doesn’t exist if you build to code (and of course it shouldn’t exist, but neither should the bonus that does exist, your stick should prevent it).
I have a similar experience. I was a devoted PhD student working long hours taking on a lot of responsibility. It burned me out, hurting my productivity. I have mixed feelings about it; I love the friends I made and the things I learned, but I don’t think I should have had to suffer what I suffered. Simultaneously I’m somewhat glad I experienced it then, because now I work in tech and I’ll _never_ work outside of business hours (I’ll hack on personal projects I consider fun if I feel like it). And I’m more productive than my colleagues that do. There’s something mysterious about the contemporary PhD, not all good and not all bad.
The organization and formatting of the single .tex file is such that one could almost read the source alone. Really nice. Also, I had no idea that GitHub did such a good job rendering the LaTeX math in markdown, it's imperfect but definitely good.
Been waiting to see what Astral would do first (with regards to product). Seems like a mix of artifactory and conda? artifactory providing a package server and conda trying to fix the difficulty that comes from Python packages with compiled components or dependencies, mostly solved by wheels, but of course PyTorch wheels requiring specific CUDA can still be a mess that conda fixes
Given Astral's heavy involvement in the wheelnext project I suspect this index is an early adopter of Wheel Variants which are an attempt to solve the problems of CUDA (and that entire class of problems not just CUDA specifically) in a more automated way than even conda: https://wheelnext.dev/proposals/pepxxx_wheel_variant_support...
I really like Berkeley Mono and I don’t regret my old purchase, but my Emacs and Terminal configs have been rocking Pragmata Pro for a while now. Looking at the version 2 release notes it appears that Berkeley Mono has some new condensed widths (something I think keeps me using Pragmata Pro). Will have to take it for a spin.
OpenMP is great. I’ve done something similar to your second case (thread local objects that are filled in parallel and later combined). In the case of “OpenMP off” (pragmas ignored), is it possible to avoid the overhead of the thread local object essentially getting copied into the final object (since no OpenMP means only a single thread local object)? I avoided this by implementing a separate code path, but I’m just wondering if there are any tricks I missed that would allow still a single code path
Give one of the threads (thread ID 0, for instance) special privileges. Its list is the one everything else is appended to, then there's only concatenation or copying if you have more than one thread.
Or, pre-allocate the memory and let each thread write to its own subset of the final collection and avoid the combine step entirely. This works regardless of the number of threads you use so long as you know the maximum amount of memory you might need to allocate. If it has no calculable upper bound, you will need to use other techniques.
My favorite thing to ask the models designed for programming is: "Using Python write a pure ASGI middleware that intercepts the request body, response headers, and response body, stores that information in a dict, and then JSON encodes it to be sent to an external program using a function called transmit." None of them ever get it right :)
I normally ask about building a multi-tenant system using async SQLAlchemy 2 ORM where some tables are shared between tenants in a global PostgreSQL schema and some are in a per-tenant schema.
Nothing gets it right first time, but when ChatGPT 4 first came out, I could talk to it more and it would eventually get it right. Not long after that though, ChatGPT degraded. It would get it wrong on the first try, but with every subsequent follow up it would forget one of the constraints. Then when it was prompted to fix that one, it forgot a different one. And eventually it would cycle through all of the constraints, getting at least one wrong each time.
Since then benchmarks came out showing that ChatGPT “didn’t really degrade”, but all of the benchmarks seemed focused on single question/answer pairs and not actual multi-turn chat. For this kind of thing, ChatGPT 4 has never managed to recover to as good as it was when it was first released in my experience.
It’s been months since I’ve had to deal with that kind of code, so I might be forgetting something, but I just tried it with Codestral and it spat out something that looked reasonable very quickly on its first try.
>It would get it wrong on the first try, but with every subsequent follow up it would forget one of the constraints. Then when it was prompted to fix that one, it forgot a different one. And eventually it would cycle through all of the constraints, getting at least one wrong each time.
That drives me nuts and makes me ragequit about half the time. Although it's usually more effective to go and correct your initial prompt rather than prompt it again
I had a similar experience. I was trying to get GPT 4 to write some R/Stan code for a bit of bayesian modelling. It would get the model wrong, and then I would walk it through how to do it right, and by the end it would almost get it right, but on the next step, it would be like, oh, this is what you want, and the output was identical to the first wrong attempt, which would start the loop over again.
Similar experience using GPT4 for help with Apple's Accessibility API. I wanted to do some non-happy-path things and it kept looping between solutions that failed to satisfy at least one of a handful of requirements that I had, and in ways that I couldn't combine the different "solutions" to meet all the requirements.
I was eventually able to figure it out with the help of some early 2010s blog posts. Sadly I didn't test giving it that context and having it attempt to find a solution again (and this was before web browsing was integrated with the web app).
More of an issue than it not knowing enough to fulfill my request (it was pretty obscure so I didn't necessarily expect that it would be able to) was that it didn't mind emitting solutions that failed to meet the requirements. "I don't know how to do that" would've been a much preferred answer.
This seems an important failure mode to me. I too have noticed gpt4 looping between a few different failure cases, in my case it was state transitions in js code. Explaining to it what it did wrong didn't help.
Give an LLM all the time you want, and they will still not get it right. In fact, they most likely will give worse and worse answers with time. That’s a big difference with a software developer.
My experience is very different. Often it (ChatGPT or Copilot, depending on what I'm trying to accomplish) gets things right the first time. When it doesn't, it's usually close enough that a bit of manual modification is all that's needed. Sometimes it's totally wrong, but I can usually point it in the right direction.
I mean, with a nonzero temperature, the randomness will eventually produce every combination of tokens in the corpus, so with a sufficiently large "all the time you want" you can produce limitless correct answers
I love to ask it to "make me a Node.js library that pings an ipv4 address, but you must use ZERO dependencies, you must only the native Node.js API modules"
The majority of models (both proprietary and open-weight) don't understand:
- by inference, ping means we're talking about ICMP
- ICMP requires raw sockets
- Node.js has no native raw socket API
You can do some CoT trickery to help it reason about the problem and maybe finally get it settled on a variety of solutions (usually some flavor of building a native add-on using C/C++/Rust/Go), or just guide it there step by step yourself, but the back and forth to get there requires a ton of pre-knowledge of the problem space which sorta defeats the purpose. If you just feed it the errors you get verbatim trying to run the code it generates, you end up in painful feedback loops.
(Note: I never expect the models to get this right, it's just a good microcosmic but concrete example of where knowledge & reasoning meets actual programming acumen, so its cool to see how models evolve to get better, if at all, at the task).
This is the same level of gotcha that everyone complains about when interviewing. It's mainly just depending on the interviewee having the same assumptions (pings definitely do not have to be icmp) and the same knowledge base, usually bespoke, (node.js peculiarities). I can see that an llm should know whether raw sockets are available, but that's not what you asked.
In fact you deliberately asked for something impossible and hold up undefined behavior as undefined like it's impugning something.
> In fact you deliberately asked for something impossible and hold up undefined behavior as undefined like it's impugning something.
Correct, I did. This is a direct indictment on a given model's ability to plan/reason in this particular context. There are plenty of situations where models will respond with "Sorry, that's not possible". Ask GPT-4 "Tell me how to grow biological wings on a human" and it will respond with something along the lines of "this isn't currently possible, but here's a theoretical exploration of the idea"
GPT-4 gets very close on its own to the node.js question via a similar response breakdown above, provided the prompt is clear and detailed enough. But I test the open weight models in the same way to see if they have the capacity to exhibit similar reasoning or chain of thought process on their own. They usually don't without excessive prompt engineering or few-shot.
I said that I don't expect models to get this right not because I don't _want_ them to, it's because I think its an important milestone when they do. Autoregressive token prediction is unlikely to produce the real outcome im testing for here, but if it ever does thats an interesting finding.
I usually through some complex Rust code with lifetime requirements. And ask them to fix it.
LLMs aren't capable on providing much help for that in general, other than some very basic cases.
The best way to get your work done is still to look into Rust forums.
It works amazingly well for the ones that never coded in Rust, at least in my experience. It took me a couple hours and 120 lines of code to set up a WebRTC signaling server.
Damn, show us your brilliant prompt then. LLMs cannot do this, not even in python, of which there are libraries like Blacksheep that honestly make it a trivial task.
My point is that you shouldn't expect to one shot everything. Have it start by writing a spec, then outline classes and methods, then write the code, and feed it debug stuff.
Well sure, but that wasn't what we were discussing. The original comment says they use that as their benchmark. While their coding task is a bit complex compared to other benchmarking prompts, it's not that crazy. Here is an example of prompts used for benchmarking with Python for reference:
At the end of the day LLMs in their current iteration aren't intended to do even moderately difficult tasks on their own but it's fun to query them to see progress when new claims are made.
Exactly, expecting one shot 100% working code with one prompt is ridiculous at this point. It's why libraries like Aider are so useful, because you can iteratively diff generated code until it's useable.
Sure it's impossible at this point, but the point of a benchmark isn't to complete the task it's to test it's efficacy overall and to see progress. None of them are 100% at even the simplistic python benchmarks, doesn't mean we shouldn't measure that capability. But sure, I get it. That's not how they are intended to be used but that's also not the point the commenter was laying out.
Prompts like yours (I ask them for a fluid dynamics simulator which also doesn't succeed) inform us of the level they have reached. A useful benchmark, given how many of the formal ones they breeze through.
I'm glad they can't quite manage this yet. Means I still have a job.
No he is right, he is saying taken to the extreme. The point is the more and more specific you have to prompt, the more you are actually contributing to the result yourself and the less the model is
Yes but the build up isn't manual. You go patching prompts with responses until the final result. The last prompt will be almost the whole code complete, obviously.
Well now we get into information density and Komolgorov complexity. The more complicated your desired output program is, the more information you'll have to put in, ie, more complicated prompts.
It's something I know how to do after figuring it out myself and discovering the potential sharp edges, so I've made it into a fun game to test the models. I'd argue that it's a great prompt (to keep using consistently over time) to see the evolution of this wildly accelerating field.
was it mostly about the targets the xz actor was interested in than some security property inherent to openbsd that would prevent that sort of dynamic linking vulnerability?
Debian and RedHat link liblzma into SSH for systemd which OpenBSD doesn't use. So in the sense of there being a larger attack surface with those distros I guess you can consider it more secure, but it's not just OpenBSD though; there are plenty of Linux distros that don't do this either.