Mistral releases Devstral2 and Mistral Vibe CLI

simonw · 2025-12-09T16:45:01 1765298701

  llm install llm-mistral
  llm mistral refresh
  llm -m mistral/devstral-2512 "Generate an SVG of a pelican riding a bicycle"

https://tools.simonwillison.net/svg-render#%3Csvg%20xmlns%3D...

Pretty good for a 123B model!

(That said I'm not 100% certain I guessed the correct model ID, I asked Mistral here: https://x.com/simonw/status/1998435424847675429)

Jimmc414 · 2025-12-09T18:57:36 1765306656

We are getting to the point that its not unreasonable to think that "Generate an SVG of a pelican riding a bicycle" could be included in some training data. It would be a great way to ensure an initial thumbs up from a prominent reviewer. It's a good benchmark but it seems like it would be a good idea to include an additional random or unannounced similar test to catch any benchmaxxing.

simonw · 2025-12-09T19:19:37 1765307977

I wrote about that possibility here: https://simonwillison.net/2025/Nov/13/training-for-pelicans-...

armcat · 2025-12-10T12:17:32 1765369052

Hi Simon! Love your work! Our of curiosity - how many pelican-cycling samples do you produce. Curious about the variance here. Thanks!

simonw · 2025-12-10T13:09:43 1765372183

I've lost count, but there are 85 posts with that tag here: https://simonwillison.net/tags/pelican-riding-a-bicycle/

I need to extract them all into a formal collection.

karambir · 2025-12-10T13:40:20 1765374020

I think the parent poster might be asking about generations per model-test. Atleast that's what I understood.

huxley · 2025-12-10T16:18:17 1765383497

A coffee-table book? A Natural History of SVG Pelicans

jgalt212 · 2025-12-10T13:10:58 1765372258

Aiden is perhaps misinformed. From a Bing search performed just now.

> Yes, I am familiar with the "pelican riding a bicycle" SVG generation test. It is a benchmark for evaluating the ability of AI models, particularly large language models (LLMs) and multi-modal systems, to generate original, high-quality SVG vector graphics based on a deliberately unusual and complex prompt. The benchmark was popularized by Simon Willison, who selected the prompt because:

100721 · 2025-12-10T15:10:46 1765379446

Web search-based RAG is very different from having something embedded in a model's training data, though.

jgalt212 · 2025-12-10T18:59:17 1765393157

ChatGPT website gives a similar answer. Are they running RAG, or the model?

> Yes — I’m familiar with the “pelican riding a bicycle” SVG generation test.

> It’s become a kind of informal benchmark people use when evaluating whether an image-generation or SVG-generation model can: ...

th0ma5 · 2025-12-09T19:27:40 1765308460

[flagged]

vanschelven · 2025-12-09T20:40:30 1765312830

Whatever you think Jimmc414's _concerns_ are (they merely state a possibility) Simon enumerates a number of concerns in the linked article, and then addresses those. So I'm not sure why you think this is so.

vnvnff · 2025-12-10T10:21:47 1765362107

It's a pattern: https://news.ycombinator.com/item?id=44725190

dugidugout · 2025-12-09T19:39:17 1765309157

Condescending and disrespectful to whom? Everybody wholsale? This doesnt seem reasonable? Please elaborate.

bravetraveler · 2025-12-09T20:52:02 1765313522

Not sure if I'd use the same descriptions so pointedly, but I can see what they mean.

It's perfectly fine to link for convenience, but it does feel a little disrespectful/SEO-y to not 'continue the conversation'. A summary in the very least, how exactly it pertains. Sell us.

In a sense, link-dropping [alone] is saying: "go read this and establish my rhetorical/social position, I'm done here"

Imagine meeting an author/producer/whatever you liked. You'd want to talk about their work, how they created it, the impact it had, and so on. Now imagine if they did that... or if they waved their hand vaguely at a catalog.

simonw · 2025-12-09T22:36:55 1765319815

I've genuinely been answering the question "what if the labs are training on your pelican benchmark" 3-4 times a week for several months at this point. I wrote that piece precisely so I didn't have to copy and paste the same arguments into dozens of different conversations.

bravetraveler · 2025-12-09T22:46:38 1765320398

Oh, no. Does this policing job pay well? /s Seriously: less is more, trust the process, any number of platitudes work here. Who are you defending against? Readers, right? You wrote your thing, defended it with more of the thing. It'll permeate. Or it won't. Does it matter?

You could be done, nothing is making you defend this (sorry) asinine benchmark across the internet. Not trying to (m|y)uck your yum, or whatever.

Remember, I did say linking for convenience is fine. We're belaboring the worst reading in comments. Inconsequential, unnecessary heartburn. Link the blog posts together and call it good enough.

Barbing · 2025-12-10T00:51:20 1765327880

Surprised to see snark re: what I thought was a standard practice (linking FAQs, essentially).

I hadn’t seen the post. It was relevant. I just read it. Lucky Ten Thousand can read it next time even though I won’t.

Simon has never seemed annoying so unlike other comments that might worry me (even “Opus made this” even though it’s cool but I’m concerned someone astroturfed), that comment would’ve never raised my eyebrows. He’s also dedicated and I love he devotes his time to a new field like this where it’s great to have attempts at benchmarks, folks cutting through chaff, etc.

bravetraveler · 2025-12-10T00:54:15 1765328055

The specific 'question' is a promise to catch training on more publicly available data, and to expect more blog links copied 'into dozens of different conversations'... Jump for joy. Stop the presses. Oops, snarky again :)

Yes, the LLM people will train on this. They will train on absolutely everything [as they have]. The comments/links prioritize engagement over awareness. My point, I suppose, if I had one is that this blogosphere can add to the chaff. I'm glad to see Simon here often/interested.

Aside: all this concern about over-fitting just reinforces my belief these things won't take the profession any time soon. Maybe the job.

simonw · 2025-12-09T23:24:26 1765322666

You don't have to convince me the pelican riding a bicycle SVG benchmark is asinine. That's kind of the point!

bravetraveler · 2025-12-09T23:25:35 1765322735

Having read the followup post being linked, I'm even more confused. Commenting or, really, anything seems even less worthwhile. That's my point.

You bring the benchmark and anticipated their... cheesing, with a promise to catch them on it. Cool announcement of an announcement. Just do that [or don't]. In a hippy sense, this is no longer yours. It's out there. Like everything else anyone wrote.

Let the LLM people train on your test. Catch them as claimed. Publish again. Huzzah, industry without overtime in the comments. It makes sense/cents to position yourself this way :)

Obviously they're going to train on anything they can get. They did. Mouse, meet cat. Some of us in the house would love it if y'all would keep it down! This is 90s rap beef all over again

charcircuit · 2025-12-09T23:47:34 1765324054

If you want a summary you can have your ai assistant summarize the link.

bravetraveler · 2025-12-09T23:49:41 1765324181

Woooooosh, please see if an LLM can help you. I'm not getting paid for this

tomrod · 2025-12-09T21:33:50 1765316030

Hell, I would consider myself graced that simonw, yes, THAT simonw, the LLM whisperer, took time out of his busy schedule to send me to a discussion I might have expressed interest in.

bravetraveler · 2025-12-09T21:45:46 1765316746

> send me to a discussion I might have expressed interest in

No, no, remember? Points to the blog you were already reading! Working diligently to build a brand: podcast, paid newsletter, the works.

tomrod · 2025-12-10T04:12:40 1765339960

I wasn't speaking to this interaction, and my point is genuine. Simonw has done fantastic work in the LLM space

bravetraveler · 2025-12-10T12:45:43 1765370743

... and my point remains: he's fine. Could be better. If he does grace us, he can choose to bait the hook more effectively. Or not. The stakes are silly-low.

This interaction is, effectively, a link dropped with an announcement of an announcement. For what has already occurred. Over-fitting, training? You don't say.

If I wanted to be more of an ass, I'd look to argue about hype generation. But I don't, I appreciate any honest effort, which I believe for Simon.

renewiltord · 2025-12-10T12:11:46 1765368706

It is SEO-y and I’m sure no small impulse is to drive traffic to his website since he’s primarily an AI influencer.

However, there are always people who are “native” to a platform and field. Pieter Levels is native to Twitter and the nomad community. Swyx is native to Twitter/HN and the devtools community. And simonw is native to at least HN and the LLM-interest community. And various streamers and onlyfans creators do the same with theirs.

Through some degree of releasing things that whatever that community values they build a relationship that allows them greater freedom in participating there. It does create a positive feedback cycle for them (and hopefully the community) that most of them will try to parlay into something else: Levels and the OnlyFans creators are probably best at this monetization of reputation but each of them is doing this. One success step for simonw would be “Creator of Pelican LLM benchmark”.

Once you’ve breached some stable point in the community the norms are somewhat relaxed. But it’s not easy to do that. You have to produce some extraordinary volume of things that people value.

I think, tbh, tptacek here could most effectively monetize if he decided to. But he doesn’t appear to want to so he’s just a participant not an influencer so to speak. Whereas someone like Levels or simonw is both.

It’s just creator economy stuff. Meta discussions like this always pop up. But ultimately simonw is past the threshold of trust. There are people who say “wtf? Why is levels making $50k/mo on a stupid vibe-coded flying game?”

It ain’t the game. It’s the following before the game. The resource is the audience.

bravetraveler · 2025-12-10T12:26:34 1765369594

Thanks for posting, I agree. I regret this being taken so pointedly at Simon, just a player in the game.

The best guy spinning the sign puts some effort in, or more crass, the best strippers make you believe.

dugidugout · 2025-12-10T18:02:20 1765389740

Well put and thank you for adding much depth here!

th0ma5 · 2025-12-09T19:42:23 1765309343

No, when did I say that?

dugidugout · 2025-12-09T19:56:51 1765310211

It isn't clear what you said.

You asserted a pattern of conduct on the user simonw:

> I think constantly replying to everybody with some link which doesn't address their concerns

Then claimed that conduct was:

> condescending and disrespectful.

I am asking you to elaborate to whom simonw is condescending and disrespecting. I don't see how it follows.

Workaccount2 · 2025-12-10T00:28:31 1765326511

It would be easy to out models that train on the bike pelican, because they would probably suck at the kayaking bumblebee.

So far though, the models good at bike pelican are also good at kayak bumblebee, or whatever other strange combo you can come up with.

So if they are trying to benchmaxx by making SVG generation stronger, that's not really a miss, is it?

majormajor · 2025-12-10T01:29:15 1765330155

That depends on if "SVG generation" is a particularly useful LLM/coding model skill outside of benchmarking. I.e., if they make that stronger with some params that otherwise may have been used for "rust type system awareness" or somesuch, it might be a net loss outside of the benchmarks.

0cf8612b2e1e · 2025-12-10T00:26:32 1765326392

I assume all of the models also have variations on, “how many ‘r’s in strawberry”.

theshrike79 · 2025-12-10T11:42:49 1765366969

The easiest way to fix these is give the model an environment to run code.

Any model can easily one-shot a python script that can count the occurrence of any letter anywhere and return the result.

It's just a tooling issue. You really can't "train" an LLM to do it because tokenisation and ... stuff.

0cf8612b2e1e · 2025-12-10T16:38:30 1765384710

I am not convinced they are executing code. Otherwise I would expect LLMs to not frequently guess the result of math questions.

Of course you could train it. Some quick scripting to find all words with repeat letters, build up sample sentences (aardvark has three a,) and you have hard coded the answer to simple questions that make your LLM look stupid.

thatwasunusual · 2025-12-09T23:51:41 1765324301

> We are getting to the point that its not unreasonable to think that "Generate an SVG of a pelican riding a bicycle" could be included in some training data.

I may be stupid, but _why_ is this prompt used as a benchmark? I mean, pelicans _can't_ ride a bicycle, so why is it important for "AI" to show that they can (at least visually)?

The "wine glass problem"[0] - and probably others - seems to me to be a lot more relevant...?

[0] https://medium.com/@joe.richardson.iii/the-curious-case-of-t...

simonw · 2025-12-10T00:15:45 1765325745

The fact that pelicans can't ride bicycles is pretty much the point of the benchmark! Asking an LLM to draw something that's physically impossible means it can't just "get it right" - seeing how different models (especially at different sizes) handle the problem is surprisingly interesting.

Honestly though, the benchmark was originally meant to be a stupid joke.

I only started taking it slightly more seriously about six months ago, when I noticed that the quality of the pelican drawings really did correspond quite closely to how generally good the underlying models were.

If a model draws a really good picture of a pelican riding a bicycle there's a solid chance it will be great at all sorts of other things. I wish I could explain why that was!

If you start here and scroll through and look at the progression of pelican on bicycle images it's honestly spooky how well they match the vibes of the models they represent: https://simonwillison.net/2025/Jun/6/six-months-in-llms/#ai-...

So ever since then I've continue to get models to draw pelicans. I certainly wouldn't suggest anyone take serious decisions on model usage based on my stupid benchmark, but it's a fun first-day initial impression thing and it appears to be a useful signal for which models are worth diving into in more detail.

thatwasunusual · 2025-12-10T02:19:32 1765333172

> If a model draws a really good picture of a pelican riding a bicycle there's a solid chance it will be great at all sorts of other things.

Why?

If I hired a worker that was really good at drawing pelicans riding a bike, it wouldn't tell me anything about his/her other qualities?!

suspended_state · 2025-12-10T07:22:03 1765351323

Your comment is funny, but please note: it's not drawing a pelican riding a bike, it's describing in SVG a pelican riding a bike. Your candidate would at least displays some knowledge of the SVG specs.

simonw · 2025-12-10T03:10:16 1765336216

I wish I knew why. I didn't think it would be a useful indicator of model skills at all when I started doing it, but over time the pattern has held that performance on pelican riding a bicycle is a good indicator of performance on other tasks.

vikramkr · 2025-12-10T04:43:30 1765341810

The difference is that the worker you hire would be a human being and not a large matrix multiplication that had parameters optimized by a a gradient descent process and embeds concepts in a higher dimensional vector space that results in all sorts of weird things like subliminal learning (https://alignment.anthropic.com/2025/subliminal-learning/).

It's not a human intelligence - it's a totally different thing, so why would the same test that you use to evaluate human abilities apply here?

Also more directly the "all sorts of other things" we want llms to be good at often involve writing code/spatial reasoning/world understanding which creating an svg of a pelican riding a bicycle very very directly evaluates so it's not even that surprising?

falcor84 · 2025-12-10T10:22:21 1765362141

For better or worse, a lot of job interviews actually do use contrived questions like this, such as the infamous "how many golf balls can you fit in a 747?"

theshrike79 · 2025-12-10T11:48:57 1765367337

What if the employee can draw a bike and a pelican, but not a pelican on a bike?

jtbaker · 2025-12-10T02:53:30 1765335210

a posteriori knowledge. the pelican isn't the point, it's just amusing. the point is that Simon has seen a correlation between this skill and and the model's general capabilities.

theshrike79 · 2025-12-10T11:48:06 1765367286

It's just a variant of the wine glass - something that doesn't exist in the source material as-is. I have a few of my own I don't share publicly.

Basically in my niche I _know_ there are no original pictures of specific situations and my prompts test whether the LLM is "creative" enough to combine multiple sources into one that matches my prompt.

I think of if like this: there are three things I want in the picture (more actually, but for the example assume 3). All three are really far from each other in relevance, in the very corner of an equilateral triangle (in the vector space of the LLM's "brain"). What I'm asking it to do is in the middle of all three things.

Every model so far tends to veer towards one or two of the points more than others because it can't figure out how to combine them all into one properly.

wisty · 2025-12-10T00:18:55 1765325935

It's not nessessarily the best benchmark, it's a popular one, probably because it's funny.

Yes it's like the wine glass thing.

Also it's kind of got depth. Does it draw the pelican and the bicycle? Can the penguin reach the peddles? How?

I can imagine a really good AI finding a funny or creative or realistic way for the penguin to reach the peddles.

An slightly worse AI will do an OK job, maybe just making the bike small or the legs too long.

An OK AI will draw a penguin on top of a bicycle and just call it a day.

It's not as binary as the wine glass example.

thatwasunusual · 2025-12-10T02:16:36 1765332996

> It's not nessessarily the best benchmark, it's a popular one, probably because it's funny.

> Yes it's like the wine glass thing.

No, it's not!

That's part of my point; the wine glass scenario is a _realistic_ scenario. The pelican riding a bike is not. It's a _huge_ difference. Why should we measure intelligence (...) in regards to something that is realistic and something that is unrealistic?

I just don't get it.

Fnoord · 2025-12-10T05:50:48 1765345848

> the wine glass scenario is a _realistic_ scenario

It is unrealistic because if you go to a restaurant, you don't get served a glass like that. It is frowned upon (alcohol is a drug, after all) and impractical (wine stains are annoying) to fill a glass of wine as such.

A pelican riding a bike, on the other hand, is realistic in a scenario because of TV for children. Example from 1950's animation/comic involving a pelican [1].

[1] https://en.wikipedia.org/wiki/The_Adventures_of_Paddy_the_Pe...

mzl · 2025-12-10T12:45:44 1765370744

A better reason why wine glasses are not filled like that is that wine glasses are designed to capture the aroma of the wine.

Since people look at a glass of wine and judge how much "value" they got based partly on how much wine it looks like, many bars and restaurants choose bad wine-glasses (for the purpose of enjoying wine) that are smalle and thus can be fulled more.

vikramkr · 2025-12-10T04:44:56 1765341896

If the thing we're measuring is a the ability to write code, visually reason, and handle extrapolating to out of sample prompts, then why shouldn't we evaluate it by asking it to write code to generate a strange image that it wouldn't have seen in its training data?

th0ma5 · 2025-12-09T19:32:01 1765308721

If this had any substance then it could be criticized, which is what they're trying to avoid.

Etheryte · 2025-12-09T21:58:01 1765317481

How? There's no way for you to verify if they put synthetic data for that into the dataset or not.

lacoolj · 2025-12-10T18:13:28 1765390408

How did you run a 123B model locally? Or did you do this on a GPU host somewhere? If so, what spec was it?

baq · 2025-12-09T18:09:50 1765303790

but can it recreate the spacejam 1996 website? https://www.spacejam.com/1996/jam.html

aschobel · 2025-12-09T19:22:01 1765308121

in case folks are missing the context

https://news.ycombinator.com/item?id=46183294

lagniappe · 2025-12-09T18:53:59 1765306439

That is not a meaningful metric given that we don't live in 1996 and neither do our web standards.

tarsinge · 2025-12-09T19:01:04 1765306864

In what year was it meaningful to have pelicans riding bicycles?

lagniappe · 2025-12-09T19:08:01 1765307281

SVG is a current standard. Do not be coy just to satisfy your urge to disagree.

tarsinge · 2025-12-09T19:40:51 1765309251

The website is live and renders correctly on my Safari mobile: https://www.spacejam.com/1996/

I may have missed something but where are we saying the website should be recreated with 1996 tech or specs? The model is free to use any modern CSS, there is no technical limitations. So yes I genuinely think it is a good generalization test, because it is indeed not in the training set, and yet it is easy an easy task for a human developer.

locallost · 2025-12-09T19:15:28 1765307728

The point stands. Whether or not the standard is current has no relevance for the ability of the "AI" to produce the requested content. Either it can or can't.

lagniappe · 2025-12-09T19:17:26 1765307846

https://news.ycombinator.com/item?id=46183673

locallost · 2025-12-10T05:32:10 1765344730

> Ergo, models for the most part will only have a cursory knowledge of a spec that your browser will never be able to parse because that isn't the spec that won.

Browsers are able to parse a webpage from 1996. I don't know what the argument in the linked comment is about, but in this one, we discuss the relevance of creating a 1996 page vs a pelican on a a bicycle in SVG.

Here is Gemini when asked how to build a webpage from 1996. Seems pretty correct. In general I dislike grand statements that are difficult to back up. In your case, if models have only a cursory knowledge of something (what does this mean in the context of LLMs anyway), what exactly they were trained on etc.

The shortened Gemini answer, the detailed version you can ask for yourself:

Layout via Tables: Without modern CSS, layouts were created using complex, nested HTML tables and invisible "spacer GIFs" to control white space.

Framesets: Windows were often split into independent sections (like a static sidebar and a scrolling content window) using Frames.

Inline Styling: Formatting was not centralized; fonts and colors were hard-coded individually on every element using the <font> tag.

Low-Bandwidth Design: Visuals relied on tiny tiled background images, animated GIFs, and the limited "Web Safe" color palette.

CGI & Java: Backend processing was handled by Perl/CGI scripts, while advanced interactivity used slow-loading Java Applets.

utopiah · 2025-12-09T20:50:08 1765313408

> neither do our web standards

I'd be curious about that actually, feel like W3C specifications (I don't mean browser support of them) rarely deprecate and precisely try to keep the Web running.

baq · 2025-12-09T19:10:49 1765307449

Yes, now please prepare an email template which renders fine in outlook using modern web standards. Write it up if you succeed, front page of HN guaranteed!

tomashubelbauer · 2025-12-09T19:04:18 1765307058

The parent comment is a reference to a different story that was on the HN home page yesterday where someone attempted that with Claude.

lagniappe · 2025-12-09T19:07:24 1765307244

Yes, and I had a lengthier response in that thread explaining why this isn't a useful metric.

https://news.ycombinator.com/item?id=46183673

MLgulabio · 2025-12-10T09:30:21 1765359021

It was a joke reference...

willahmad · 2025-12-09T17:20:46 1765300846

I think this benchmark could be slightly misleading to assess coding model. But still very good result.

Yes, SVG is code, but not in a sense of executable with verifiable inputs and outputs.

jstummbillig · 2025-12-09T18:35:10 1765305310

I love that we are earnestly contemplating the merits of the pelican benchmark. What a timeline.

andrepd · 2025-12-09T21:10:06 1765314606

It's not even halfway up the list of inane things of the AI hype cycle.

hdjrudni · 2025-12-10T02:56:13 1765335373

But it does have a verifiable output, no more or less than HTML+CSS. Not sure what you mean by "input" -- it's not a function that takes in parameters if that's what you're getting at, but not every app does.

iberator · 2025-12-09T19:37:47 1765309067

Where did you get llm tool from?!

fauigerzigerk · 2025-12-09T19:45:43 1765309543

He made it: https://github.com/simonw/llm

techsystems · 2025-12-09T22:49:46 1765320586

Cool! I can't find it on the read me, but can it run Qwen locally?

simonw · 2025-12-09T23:04:55 1765321495

The best way to do that at the moment is using the llm-ollama plugin.

samgutentag · 2025-12-10T17:44:36 1765388676

"Generate an SVG of a pelican riding a bicycle" is the new "but can it run Crysis"

breedmesmn · 2025-12-09T18:48:46 1765306126

Impressive! I'm really excited to leverage this in my gooning sessions!

cpursley · 2025-12-09T16:52:29 1765299149

Skipped the bicycle entirely and upgraded to a sweet motorcycle :)

aorth · 2025-12-09T16:57:19 1765299439

Looks like a Cybertruck actually!

BudaDude · 2025-12-09T17:58:34 1765303114

I was thinking a Warthog

https://www.halopedia.org/Warthog

lubujackson · 2025-12-09T20:30:57 1765312257

The Batman motorcycle!

troyvit · 2025-12-09T21:24:08 1765315448

I'm Pelicanman </raspy voice>

taneq · 2025-12-10T07:48:24 1765352904

The Dark Noot.

felixg3 · 2025-12-09T17:33:17 1765301597

Is it really an svg if it’s just embedded base64 of a jpg

joombaga · 2025-12-09T19:07:06 1765307226

You were seeing the base64 image tag output at the bottom. The SVG input is at the top.

esafak · 2025-12-09T16:27:45 1765297665

Less than a year behind the SOTA, faster, and cheaper. I think Mistral is mounting a good recovery. I would not use it yet since it is not the best along any dimension that matters to me (I'm not EU-bound) but it is catching up. I think its closed source competitors are Haiku 4.5 and Gemini 3 Pro Fast (TBA) and whatever ridiculously-named light model OpenAI offers today (GPT 5.1 Codex Max Extra High Fast?)

kevin061 · 2025-12-09T18:35:40 1765305340

The OpenAI thing is named Garlic.

(Surely they won't release it like that, right..?)

esafak · 2025-12-09T19:18:34 1765307914

TIL: https://garlicmodel.com/

That looks like the next flagship rather than the fast distillation, but thanks for sharing.

kevin061 · 2025-12-09T19:27:54 1765308474

Lol, someone vibecoded an entire website for OpenAI's model, that's some dedication.

BoorishBears · 2025-12-09T22:55:45 1765320945

People have been doing this for literally every anticipated model release, and I presume skimming some amount of legitimate interest since their sites end up being top indexed until the actual model is released.

Google should be punishing these sites but presumably it's too narrow of a problem for them to care.

kevin061 · 2025-12-09T22:57:43 1765321063

Black SEO in the age of LLMs

dmix · 2025-12-10T01:34:46 1765330486

It would need outbound links to be SEO

Or at least a profit model. I don't see either on that page but maybe I'm missing something

ewoodrich · 2025-12-10T03:18:19 1765336699

Every link in the "Legal" tree is a dead end redirecting back to the home page... strange thing to put together without any acknowledgement, unless they spam it on LLM adjacent subreddits for clout/karma?

ttul · 2025-12-10T01:28:22 1765330102

"GPT, please make me a website about OpenAI's 'Garlic' model."

YetAnotherNick · 2025-12-09T18:45:22 1765305922

No this is comparable to Deepseek-v3.2 even on their highlight task, with significantly worse general ability. And it's priced 5x of that.

esafak · 2025-12-09T19:50:36 1765309836

It's open source; the price is up to the provider, and I do not see any on openrouter yet. ̶G̶i̶v̶e̶n̶ ̶t̶h̶a̶t̶ ̶d̶e̶v̶s̶t̶r̶a̶l̶ ̶i̶s̶ ̶m̶u̶c̶h̶ ̶s̶m̶a̶l̶l̶e̶r̶,̶ ̶I̶ ̶c̶a̶n̶ ̶n̶o̶t̶ ̶i̶m̶a̶g̶i̶n̶e̶ ̶i̶t̶ ̶w̶i̶l̶l̶ ̶b̶e̶ ̶m̶o̶r̶e̶ ̶e̶x̶p̶e̶n̶s̶i̶v̶e̶,̶ ̶l̶e̶t̶ ̶a̶l̶o̶n̶e̶ ̶5̶x̶.̶ ̶I̶f̶ ̶a̶n̶y̶t̶h̶i̶n̶g̶ ̶D̶e̶e̶p̶S̶e̶e̶k̶ ̶w̶i̶l̶l̶ ̶b̶e̶ ̶5̶x̶ ̶t̶h̶e̶ ̶c̶o̶s̶t̶.̶

edit: Mea culpa. I missed the active vs dense difference.

NitpickLawyer · 2025-12-09T20:45:17 1765313117

> Given that devstral is much smaller, I can not imagine it will be more expensive

Devstral 2 is 123B dense. Deepseek is 37B Active. It will be slower and more expensive to run inference on this than dsv3. Especially considering that dsv3.2 has some goodies that make inference at higher context be more effective than their previous gen.

syntaxing · 2025-12-09T22:48:13 1765320493

Devstral is purely nonthinking too it’s very possible it uses less models (I don’t know how DS 3.2 nonthinking compares). It’s interesting because Qwen pretty much proved hybrid models work worse than fully separate models.

aimanbenbaha · 2025-12-09T23:46:50 1765324010

Deepseek v3.2 is that cheap because its attention mechanism is ridiculously efficient.

esafak · 2025-12-10T02:08:08 1765332488

Yeah, DeepSeek Sparse Attention. Section 2: https://arxiv.org/abs/2512.02556

InsideOutSanta · 2025-12-09T19:38:35 1765309115

I gave Devstral 2 in their CLI a shot and let it run over one of my smaller private projects, about 500 KB of code. I asked it to review the codebase, understand the application's functionality, identify issues, and fix them.

It spent about half an hour, correctly identified what the program did, found two small bugs, fixed them, made some minor improvements, and added two new, small but nice features.

It introduced one new bug, but then fixed it on the first try when I pointed it out.

The changes it made to the code were minimal and localized; unlike some more "creative" models, it didn't randomly rewrite stuff it didn't have to.

It's too early to form a conclusion, but so far, it's looking quite competent.

Staross · 2025-12-10T15:15:19 1765379719

Also tried it on a small project, it did ok finding issues but completely failed doing rather basic edits, like it lost closing brackets or used wrong syntax and couldn't recover. The CLI was easy to setup and use though.

embedding-shape · 2025-12-10T15:41:09 1765381269

Did you try it via OpenRouter? If so, what provider? I've noticed some providers seems to not exactly be upfront about what quantization they're using, you can see that the responses from some providers who supposedly run the exact same model and weights give vastly different responses.

Back when Devstral 1 released, this was made very noticeable to me because the ones who used the smaller quantizations were unable to actually properly format the code, just as you noticed, that's why this sounded so similar to what I've seen before.

MLgulabio · 2025-12-09T19:42:34 1765309354

On what hardware did you run it?

syntaxing · 2025-12-09T22:42:43 1765320163

FWIW, it’s free through Mistral right now

seaal · 2025-12-10T02:31:12 1765333872

and openrouter https://openrouter.ai/mistralai/devstral-2512:free

tamnd · 2025-12-10T12:18:14 1765369094

OpenRouter rate limit is pretty bad, almost unuseable. And they take margin 5.5% on the based models.

freakynit · 2025-12-10T02:57:50 1765335470

So I tested the bigger model with my typical standard test queries which are not so tough, not so easy. They are also some that you wouldn't find extensive training data for. Finally, I already have used them to get answers from gpt-5.1, sonnet 4.5 and gemini 3 ....

Here is what I think about the bigger model: It sits between sonnet 4 and sonnet 4.5. Something like "sonnet 4.3". The response sped was pretty good.

Overall, I can see myself shifting to this for reguar day-to-day coding if they can offer this for copetitive pricing.

I'll still use sonnet 4.5 or gemini 3 for complex queries, but, for everything else code related, this seems to be pretty good.

Congrats Mistral. You most probably have caught up to the big guys. Not there yet exactly, but, not far now.

embedding-shape · 2025-12-09T16:09:27 1765296567

Look interesting, eager to play around with it! Devstral was a neat model when it released and one of the better ones to run locally for agentic coding. Nowadays I mostly use GPT-OSS-120b for this, so gonna be interesting to see if Devstral 2 can replace it.

I'm a bit saddened by the name of the CLI tool, which to me implies the intended usage. "Vibe-coding" is a fun exercise to realize where models go wrong, but for professional work where you need tight control over the quality, you can obviously not vibe your way to excellency, hard reviews are required, so not "vibe coding" which is all about unreviewed code and just going with whatever the LLM outputs.

But regardless of that, it seems like everyone and their mother is aiming to fuel the vibe coding frenzy. But where are the professional tools, meant to be used for people who don't want to do vibe-coding, but be heavily assisted by LLMs? Something that is meant to augment the human intellect, not replace it? All the agents seem to focus on off-handing work to vibe-coding agents, while what I want is something even tighter integrated with my tools so I can continue delivering high quality code I know and control. Where are those tools? None of the existing coding agents apparently aim for this...

williamstein · 2025-12-09T16:18:15 1765297095

Their new CLI agent tool [1] is written in Python unlike similar agents from Anthropic/Google (Typescript/Bun) and OpenAI (Rust). It also appears to have first class ACP support, where ACP is the new protocol from Zed [2].

[1] https://github.com/mistralai/mistral-vibe

[2] https://zed.dev/acp

esafak · 2025-12-09T16:24:57 1765297497

I did not know A2A had a competitor :(

4b11b4 · 2025-12-09T16:38:55 1765298335

They're different use cases, ACP is for clients (UIs, interfaces)

embedding-shape · 2025-12-09T17:29:23 1765301363

> Their new CLI agent tool [1] is written in

This is exactly the CLI I'm referring to, whose name implies it's for playing around with "vibe-coding", instead of helping professional developers produce high quality code. It's the opposite of what I and many others are looking for.

chrsw · 2025-12-09T18:20:11 1765304411

I think that's just the name they picked. I don't mind it. Taking a glance at what it actually does, it just looks like another command line coding assistant/agent similar to Opencode and friends. You can use it for whatever you want not just "vibe coding", including high quality, serious, professional development. You just have to know what you're doing.

hadlock · 2025-12-09T19:44:28 1765309468

>vibe-coding

A surprising amount of programming is building cardboard services or apps that only need to last six months to a year and then thrown away when temporary business needs change. Execs are constantly clamoring for semi-persistent dashboards and ETL visualized data that lasts just long enough to rein in the problem and move on to the next fire. Agentic coding is good enough for cardboard services that collapse when they get wet. I wouldn't build an industrial data lake service with it, but you can certainly build cardboard consumers of the data lake.

bigiain · 2025-12-10T01:01:05 1765328465

You are right.

But there is nothing more permanent that a quickly hacked together prototype or personal productivity hack that works. There are so many Python (or Perl or Visual Basic) scripts or Excel spreadsheets - created by people who have never been "developers" - which solve in-the-trenches pain points and become indispensable in exactly the way _that_ xkcd shows.

3vidence · 2025-12-10T16:36:46 1765384606

There is a phrase I've heard a number of times in my career that I find relevant here.

"There is nothing more permanent than a temporary demo"

pdntspa · 2025-12-09T16:29:24 1765297764

> But where are the professional tools, meant to be used for people who don't want to do vibe-coding, but be heavily assisted by LLMs? Something that is meant to augment the human intellect, not replace it?

Claude Code not good enough for ya?

embedding-shape · 2025-12-09T17:30:46 1765301446

Claude Code has absolutely zero features that help me review code or do anything else than vibe-coding and accept changes as they come in. We need diff-comparisons between different executions, tailored TUI for that kind of work and more. Claude Code is basically a MVP of that.

Still, I do use Claude Code and Codex daily as there is nothing better out there currently. But they still feel tailored towards vibe-coding instead of professional development.

vidarh · 2025-12-09T18:17:30 1765304250

I really do not want those things in Claude COde - I much prefer choosing my own diff tools etc. and running them in a separate terminal. If they start stuffing too much into the TUI they'd ruin it - if you want all that stuff built in, they have the VS Code integration.

ido · 2025-12-10T18:29:54 1765391394

using claude code via the VS Code plugin gives you side by side diffs as it works.

embedding-shape · 2025-12-09T19:09:07 1765307347

Me neither, hence the stated preference for something completely new and different, a stab in the different direction instead of the same boring iteration on yet another agentic TUI coder.

Havoc · 2025-12-10T00:58:27 1765328307

Mind elaborating a bit on the diff tool / flow you’re using? Trying to follow along better with what CC is doing

vidarh · 2025-12-10T14:24:42 1765376682

I don't want/use anything fancy - I just use git diff in a separate terminal. I don't care about the individual changes Claude is making during a unit of work. I'll review a final change. Sometimes not even that - if the tests pass I may way until it's committed a bunch of changes, and review them as a whole.

Trying to follow along better is exactly the opposite of what I'd advocate - it's a waste of time especially with Claude, as Claude tends to favour trying lots of things, seeing what works, and revising its approach multiple times for complex tasks. If you follow along every step, you'll be tearing your hair out over stupid choices that it'll undo within seconds if you just let it work.

Havoc · 2025-12-10T14:39:20 1765377560

That makes sense. Thanks for explaining

jbs789 · 2025-12-10T07:25:17 1765351517

Claude code run in a VS Code terminal window pops up a diff in VSCode before making changes. Not sure if that helps. I do have the Claude Code extension installed too.

I find the flow works bc if it starts going off piste I just end it. Plus I then get my pre-commit hooks etc. I still like being relatively hands on though.

pdntspa · 2025-12-10T04:41:26 1765341686

IntelliJ's AI service as a PR summarizer that I have found very helpful

johnfn · 2025-12-09T18:45:48 1765305948

> Claude Code has absolutely zero features that help me review code

Err, doesn’t it have /review?

victorbjorklund · 2025-12-09T19:15:13 1765307713

What’s wrong with using GIT for reviewing the changes?

embedding-shape · 2025-12-09T21:21:10 1765315270

Are any of them integrated with git? AFAIK, you'd have to instruct them to use git for you if you don't want to do it manually.

Imagine a GUI built around git branches + agents working in those branches + tooling to manage the orchestration and small review points, rather than "here's a chat and tool calling, glhf".

KronisLV · 2025-12-10T10:18:16 1765361896

> Are any of them integrated with git?

All of the models that can do tool calls are typically good enough to use Git.

Just this week I used both Claude Code and Codex to look at unstaged/staged changes and to review them multiple times, even do comparison between a feature branch and the main branch to identify why a particular feature might have broken in the feature branch.

embedding-shape · 2025-12-10T14:19:15 1765376355

> All of the models that can do tool calls are typically good enough to use Git.

But again, it's the "user message > llm reason > llm tool call > tool response > llm reason > llm response" flow I think is inefficient and not good enough. It's a lazy solution built on top of the chat flow.

What I imagined would exist by now would be something smarter, where you don't say "Ok, now please commit this" or whatever.

I already have a tool for myself that launch Codex, Claude Code, Qwen Code(r?) and Gemini for each change I do, and automatically manage them into git branches, and lets me diff between what they do and so on.

Yet I still think we haven't really figured out a good UX for this.

zer0tonin · 2025-12-10T10:54:35 1765364075

Aider is integrated with git

jbellis · 2025-12-09T18:58:46 1765306726

> where are the professional tools, meant to be used for people who don't want to do vibe-coding, but be heavily assisted by LLMs?

This is what we're building at Brokk: https://brokk.ai/

Quick intro: https://blog.brokk.ai/introducing-lutz-mode/

johanvts · 2025-12-09T16:15:12 1765296912

Did you try Aider?

embedding-shape · 2025-12-09T17:31:31 1765301491

I did, although a long time ago, so maybe I need to try it again. But it still seems to be stuck in a chat-like interface instead of something tailored to software development. Think IDE but better.

vidarh · 2025-12-09T18:23:03 1765304583

When I think "IDE but better", a Claude Code-like interface is increasingly what I want.

If you babysit every interaction, rather than reviewing a completed unit of work of some size, you're wasting your time second-guessing that the model won't "recover" from stupid mistakes. Sometimes that's right, but more often than not it corrects itself faster than you can.

And so it's far more effective to interact with it far more async, where the UI is more for figuring out what it did if something doesn't seem right, than for working live. I have Claude writing a game engine in another window right now, while writing this, and I have no interest in reviewing every little change, because I know the finished change will look nothing like the initial draft (it did just start the demo game right now, though, and it's getting there). So I review no smaller units of change than 30m-1h, often it will be hours, sometimes days, between each time I review the output, when working on something well specified.

johanvts · 2025-12-09T18:30:46 1765305046

It has a new “watch files” mode where you can work interactively. You just code normally but can send commands to the llm via a special string. Its a great way if interacting with LLMs, if only they where much faster.

macNchz · 2025-12-09T20:15:04 1765311304

If you're interested in much faster LLM coding, GLM 4.6 on Cerebras is pretty mind blowing. It's not quite as smart as the latest Claude and Gemini, but it generates code so fast it's kind of comical if you're used to the other models. Good with Aider since you can keep it on a tighter leash than with a fully agentic tool.

reachtarunhere · 2025-12-09T18:42:26 1765305746

If your goal is to edit code and not discuss it aider also supports a watch mode. You can keep adding comments about what you want it to do in a minimal format and it will make changes to the files and you can diff/revert them.

zmmmmm · 2025-12-09T21:01:58 1765314118

I think Aider is closest to what you want.

The chat interface is optimal to me because you often are asking questions and seeking guidance or proposals as you are making actual code changes. On reason I do like it is that its default mode of operation is to make a commit for each change it makes. So it is extremely clear what the AI did vs what you did vs what is a hodge podge of both.

As others have mentioned, you can integrate with your IDE through the watch mode. It's somewhat crude but still useful way. But I find myself more often than not just running Aider in a terminal under the code editor window and chatting with it about what's in the window.

embedding-shape · 2025-12-09T21:22:45 1765315365

> I think Aider is closest to what you want.

> The chat interface

Seems very much not, if it's still a chat interface :) Figuring out a chat UX is easy compared to something that was creating with letting LLM fill in some parts from the beginning. I guess I'm searching for something with a different paradigm than just "chat + $Something".

zmmmmm · 2025-12-09T23:11:46 1765321906

the question is, how do you want to provide instructions for what the AI is to do? You might not like calling it "chat" but somehow you need to communicate that, right? With aider you can write a comment for a function and then instruct it to finish the function inline (see other comments). But unless you just want pure autocomplete based on it guessing things, you need to provide guidance to it somehow.

embedding-shape · 2025-12-09T23:18:35 1765322315

I don't know exactly, but I guess in a more declarative manner rather than anything. Maybe we set goals/milestones/concrete objectives, or similar, rather than imperatively steer it, give it space to experiment yet make it very easy to understand exactly what important tradeoffs everything is doing.

It's all very fluffy and theoretical of course.

xmcqdpt2 · 2025-12-10T05:43:42 1765345422

I think the problem is that models are just not that good yet. At least for my usage at work, the CLI tools are the fastest way to get something useful, but if you can't describe basically exactly what you want, you get garbage.

embedding-shape · 2025-12-10T11:59:59 1765367999

They are good enough, but people aren't exploring other UIs enough. The TUI tools (which I think you're referring to, Codex, Claude Code et al) are a good start, but they feel like a prototype compared to a completely different UI. You'd still describe what you want, but not imperative in a chat window, but some other manner.

zmmmmm · 2025-12-10T00:22:33 1765326153

I find a good compromise on that front is not to use the chat primarily, but to create files like 'ARCHITECTURE.md', 'REQUIREMENTS.md' and put information in there describing how the application works. Then you add those to the chat as context docs.From the chat interface then you are just referring to those not just describing features willy nilly. So the nice thing is you are building documentation for the application in a formal sense as part of instructing the LLM.

embedding-shape · 2025-12-10T00:29:13 1765326553

But that is the typical agentic LLM coder style program I was initially referring to, saying we maybe should explore other alternatives to. It's too basic and primitive, with some imagination.

mhast · 2025-12-10T00:45:19 1765327519

The typical "best practice" for these tools tend to be to ask it something like

"I want you to do feature X. Analyse the code for me and make suggestions how to implement this feature."

Then it will go off and work for a while and typically come back after a bit with some suggestions. Then iterate on those if needed and end with.

"Ok. Now take these decided upon ideas and create a plan for how to implement. And create new tests where appropriate."

Then it will go off and come back with a plan for what to do. And then you send it off with.

"Ok, start implementing."

So sure. You probably can work on this to make it easier to use than with a CLI chat. It would likely be less like an IDE and more like a planning tool you'd use with human colleagues though.

troyvit · 2025-12-09T21:31:39 1765315899

Aider can be a chat interface and it's great for that but you can also use it from your editor by telling it to watch your files.[1]

So you'd write a function name and then tell it to flesh it out.

  function factorial(n) // Implement this. AI!

Becomes:

  function factorial(n) {
    if (n === 0 || n === 1) {
      return 1;
    } else {
      return n \* factorial(n - 1);
    }
  }

Last I looked Aider's maintainer has had to focus on other things recently, but aider-ce is a fantastic fork.

I'm really curious to try Mistral's vibe, but even though I'm a big fanboi I don't want to be tied to just one model. Aider lets tier your models such that your big, expensive model can do all the thinking and then stuff like code reviews can run through a smaller model. It's a pretty capable tool

Edit: Fix formatting

[1] https://aider.chat/docs/usage/watch.html

zmmmmm · 2025-12-09T22:02:21 1765317741

> I don't want to be tied to just one model.

Very much this for me - I really don't get why, given a new models are popping out every month from different providers, people are so happy to sink themselves into provider ecosystems when there are open source alternatives that work with any model.

The main problem with Aider is it isn't agentic enough for a lot of people but to me that's a benefit.

andai · 2025-12-09T16:56:45 1765299405

I created a very unprofessional tool, which apparently does what you want!

While True:

0. Context injected automatically. (My repos are small.)

1. I describe a change.

2. LLM proposes a code edit. (Can edit multiple files simultaneously. Only one LLM call required :)

3. I accept/reject the edit.

htrp · 2025-12-10T14:37:08 1765377428

what's wrong with the current ide tools?

chrsw · 2025-12-09T18:08:20 1765303700

> run locally for agentic coding. Nowadays I mostly use GPT-OSS-120b for this

What kind of hardware do you have to be able to run a performant GPT-OSS-120b locally?

embedding-shape · 2025-12-09T19:11:25 1765307485

RTX Pro 6000, ends up taking ~66GB when running the MXFP4 native quant with llama-server/llama.cpp and max context, as an example. Guess you could do it with two 5090s with slightly less context, or different software aimed at memory usage efficiency.

kristianp · 2025-12-09T22:06:57 1765318017

That has 96GB GDDR7 ECC, to save people looking it up.

fgonzag · 2025-12-09T19:04:50 1765307090

The model is 64GB (int4 native), add 20GB or so for context.

There are many platforms out there that can run it decently.

AMD strix halo, Mac platforms. Two (or three without extra ram) of the new AMD AI Pro R9700 (32GB of RAM, $1200), multi consumer gpu setups, etc.

FuckButtons · 2025-12-09T21:01:51 1765314111

Mbp 128gb.

true2octave · 2025-12-09T23:10:49 1765321849

High quality code is a thing from the past

What matters is high quality specifications including test cases

embedding-shape · 2025-12-09T23:15:42 1765322142

> High quality code is a thing from the past

Says the person who will find themselves unable to change the software even in the slightest way without having to large refactors across everything at the same time.

High quality code matters more than ever, would be my argument. The second you let the LLM sneak in some quick hack/patch instead of correctly solving the problem, is the second you invite it to continue doing that always.

bigiain · 2025-12-10T01:10:19 1765329019

I dunno...

I have a feeling this will only supercharge the long established industry practice of new devs or engineering leadership getting recruited and immediately criticising the entire existing tech stack, and pushing for (and often succeeding) a ground up rewrite in language/framework de jour. This is hilariously common in web work, particularly front end web work. I suspect there are industry sectors that're well protected from this, I doubt people writing firmware for fuel injection and engine management systems suffer too much from this, the Javascript/Nodejs/NPM scourge _probably_ hasn't hit the PowerPC or 68K embedded device programming workflow. Yet...

bigiain · 2025-12-10T01:15:08 1765329308

"high quality specifications" have _always_ been a thing that matters.

In my mind, it's somewhat orthogonal to code quality.

Waterfall has always been about "high quality specifications" written by people who never see any code, much less write it. Agile make specs and code quality somewhat related, but in at least some ways probably drives lower quality code in the pursuit of meeting sprint deadlines and producing testable artefacts at the expense of thoroughness/correctness/quality.

pluralmonad · 2025-12-09T16:07:50 1765296470

I'm sure I'm not the only one that thinks "Vibe CLI" sounds like an unserious tool. I use Claude Code a lot and little of it is what I would consider Vibe Coding.

tormeh · 2025-12-09T16:39:01 1765298341

They're looking for free publicity. "This French company launched a tool that lets you 'vibe' an application into being. Programmers outraged!"

klysm · 2025-12-09T16:09:52 1765296592

Using LLM's to write code is inherently best for unserious work.

dwaltrip · 2025-12-09T16:38:41 1765298321

These are the cutting insights I come to HN for.

neevans · 2025-12-09T17:57:47 1765303067

these are just old senior devs not wanting to accept new changes in the industry.

reyqn · 2025-12-09T21:36:22 1765316182

These are the cutting insights I come to HN for.

freakynit · 2025-12-10T03:10:44 1765336244

"Not reviewing generated code" is the problem. Not the LLM generated code.

jimmydoe · 2025-12-09T16:10:44 1765296644

Maybe they are just trying to be funny.

Eupolemos · 2025-12-10T07:32:22 1765351942

Their chat was called "Le Chat" - it's just their style.

And while it may miss the HN crowd, one of the main selling-points of AI coding is the ease and playfulness.

sofixa · 2025-12-10T13:38:40 1765373920

It's still called "Le Chat" (which means The Cat in French), hence the occasional pun with a cat icon in various places on their website.

Eupolemos · 2025-12-10T14:21:52 1765376512

I didn't know about the cat!

Thanks :)

kilpikaarna · 2025-12-10T05:14:18 1765343658

Agree, but that's just the term for any LLM-assisted development now.

Even the Gemini 3 announcement page had some bit like "best model for vibe coding".

isodev · 2025-12-09T16:40:27 1765298427

If you’re letting Claude write code you’re vibe coding

andai · 2025-12-09T16:59:04 1765299544

So people have different definitions of the word, but originally Vibe Coding meant "don't even look at the code".

If you're actually making sure it's legit, it's not vibe coding anymore. It's just... Backseat Coding? ;)

There's a level below that I call Power Coding (like power armor) where you're using a very fast model interactively to make many very small edits. So you're still doing the conceptual work of programming, but outsourcing the plumbing (LLM handles details of syntax and stdlib).

HarHarVeryFunny · 2025-12-09T19:57:53 1765310273

Peer coding?

Maybe common usage is shifting, but Karpathy's "vibe coding" was definitely meant to be a never look at the code, just feel the AI vibes thing.

isodev · 2025-12-09T21:27:33 1765315653

I know tech bros like to come up with fancy words to make trivial things sounds fancy but as long as it’s a slop out process, it’s vibe coding. If you’re fixing what a bot spits out, should be a different word … something painful that could’ve been avoided?

Also, we’re both “people in tech”, we know LLMs can’t conceptualise beyond finding the closest collection of tokens rhyming with your prompt/code. Doesn’t mean it’s good or even correct. So that’s why it’s vibe coding.

brazukadev · 2025-12-09T18:13:09 1765303989

> If you're actually making sure it's legit, it's not vibe coding anymore.

sorry to disappoint you but that is also been considered vibecoding. It is just not pejorative.

theLiminator · 2025-12-09T19:14:30 1765307670

Pretty sure Karpathy coined the term here: https://x.com/karpathy/status/1886192184808149383

Imo, if you read the code, it's no longer vibecoding.

NitpickLawyer · 2025-12-09T16:59:22 1765299562

The original definition was very different. The main thing with vibe coding is that you don't care about the code. You don't even look at the code. You prompt, test that you got what you wanted, and move on. You can absolutely use cc to vibe code. But you can also use it to ... code based on prompts. Or specs. Or docs. Or whatever else. The difference is if you want / care to look at the code or not.

sunaookami · 2025-12-10T06:22:31 1765347751

No, that's not the definition of "vibe coding". Vibe coding is letting the model do whatever without reviewing it and not understanding the architecture. This was the original definition and still is.

tomashubelbauer · 2025-12-09T18:19:02 1765304342

It sure doesn't feel like it given how closely I have to babysit Claude Code lest I don't recognize the code after Claude Code is done with it when left to its own devices for a minute.

giancarlostoro · 2025-12-09T21:29:06 1765315746

It gets pretty close for me, but I usually tell it how I want it done from the get go.

princehonest · 2025-12-09T18:12:57 1765303977

Let's say you had a hardware budget of $5,000. What machine would you buy or build to run Devstral Small 2? The HuggingFace page claims it can run on a Mac with 32 GB of memory or an RTX 4090. What kind of tokens per second would you get on each? What about DGX Spark? What about RTX 5090 or Pro series? What about external GPUs on Oculink with a mini PC?

clusterhacks · 2025-12-09T19:57:58 1765310278

All those choices seem to have very different trade-offs? I hate $5,000 as a budget - not enough to launch you into higher-VRAM RTX Pro cards, too much (for me personally) to just spend on a "learning/experimental" system.

I've personally decided to just rent systems with GPUs from a cloud provider and setup SSH tunnels to my local system. I mean, if I was doing some more HPC/numerical programming (say, similarity search on GPUs :-) ), I could see just taking the hit and spending $15,000 on a workstation with an RTX Pro 6000.

For grins:

Max t/s for this and smaller models? RTX 5090 system. Barely squeezing in for $5,000 today and given ram prices, maybe not actually possible tomorrow.

Max CUDA compatibility, slower t/s? DGX Spark.

Ok with slower t/s, don't care so much about CUDA, and want to run larger models? Strix Halo system with 128gb unified memory, order a framework desktop.

Prefer Macs, might run larger models? M3 Ultra with memory maxed out. Better memory bandwidth speed, mac users seem to be quite happy running locally for just messing around.

You'll probably find better answers heading off to https://www.reddit.com/r/LocalLLaMA/ for actual benchmarks.

kpw94 · 2025-12-09T21:40:02 1765316402

> I've personally decided to just rent systems with GPUs from a cloud provider and setup SSH tunnels to my local system.

That's a good idea!

Curious about this, if you don't mind sharing:

- what's the stack ? (Do you run like llama.cpp on that rented machine?)

- what model(s) do you run there?

- what's your rough monthly cost? (Does it come up much cheaper than if you called the equivalent paid APIs)

clusterhacks · 2025-12-09T22:46:12 1765320372

I ran ollama first because it was easy, but now download source and build llama.cpp on the machine. I don't bother saving a file system between runs on the rented machine, I build llama.cpp every time I start up.

I am usually just running gpt-oss-120b or one of the qwen models. Sometimes gemma? These are mostly "medium" sized in terms of memory requirements - I'm usually trying unquantized models that will easily run on an single 80-ish gb gpu because those are cheap.

I tend to spend $10-$20 a week. But I am almost always prototyping or testing an idea for a specific project that doesn't require me to run 8 hrs/day. I don't use the paid APIs for several reasons but cost-effectiveness is not one of those reasons.

Juminuvi · 2025-12-10T02:36:34 1765334194

I know you say you don't use the paid apis, but renting a gpu is something I've been thinking about and I'd be really interested in knowing how this compares with paying by the token. I think gpt-oss-120b is 0.10/input 0.60/output per million tokens in azure. In my head this could go a long way but I haven't used gpt oss agentically long enough to really understand usage. Just wondering if you know/be willing to share your typical usage/token spend on that dedicated hardware?

KronisLV · 2025-12-10T10:32:47 1765362767

For comparison, here's my own usage with various cloud models for development:

  * Claude in December: 91 million tokens in, 750k out
  * Codex in December: 43 million tokens in, 351k out
  * Cerebras in December: 41 million tokens in, 301k out
  * (obviously those figures above are so far in the month only)
  * Claude in November: 196 million tokens in, 1.8 million out
  * Codex in November: 214 million tokens in, 4 million out
  * Cerebras in November: 131 million tokens in, 1.6 million out
  * Claude in October: 5 million tokens in, 79k out
  * Codex in October: 119 million tokens in, 3.1 million out

As for Cerebras in October, I don't have the data because they don't show the Qwen3 Coder model that was deprecated, but it was way more: https://blog.kronis.dev/blog/i-blew-through-24-million-token...

In general, I'd say that for the stuff I do my workloads are extremely read heavy (referencing existing code, patterns, tests, build and check script output, implementation plans, docs etc.), but it goes about like this:

  * most fixed cloud subscriptions will run out really quickly and will be insufficient (Cerebras being an exception)
  * if paying per token, you *really* want the provider to support proper caching, otherwise you'll go broke
  * if you have local hardware that is great, but it will *never* compete with the cloud models, so your best bet is to run something good enough, basically cover all of your autocomplete needs, and also with tools like KiloCode an advanced cloud model can do the planning and a simpler local model do the implementation, then the cloud model validate the output

clusterhacks · 2025-12-10T14:32:28 1765377148

Sorry, I don't much track or keep up with those specifics other than knowing I'm not spending much per week. My typical scenario is to spin up an instance that costs less than $2/hr for 2-4 hours. It's all just exploratory work really. Sometimes I'm running a script that is making a call to the LLM server api, other times I'm just noodling around in the web chat interface.

bigiain · 2025-12-10T01:19:21 1765329561

I don't suppose you have (or would be interested in writing) a blog post about how you set that up? Or maybe a list of links/resources/prompts you used to learn how to get there?

clusterhacks · 2025-12-10T02:22:19 1765333339

No, I don't blog. But I just followed the docs for starting an instance on lambda.ai and the llama.cpp build instructions. Both are pretty good resources. I had already setup an SSH key with lambda and the lambda OS images are linux pre-loaded with CUDA libraries on startup.

Here are my lazy notes + a snippet of the history file from the remote instance for a recent setup where I used the web chat interface built into llama.cpp.

I created an instance gpu_1x_gh200 (96 GB on ARM) at lambda.ai.

connected from terminal on my box at home and setup the ssh tunnel.

ssh -L 22434:127.0.0.1:11434 ubuntu@<ip address of rented machine - can see it on lambda.ai console or dashboard>

  Started building llama.cpp from source, history:    
     21  git clone   https://github.com/ggml-org/llama.cpp
     22  cd llama.cpp
     23  which cmake
     24  sudo apt list | grep libcurl
     25  sudo apt-get install libcurl4-openssl-dev
     26  cmake -B build -DGGML_CUDA=ON
     27  cmake --build build --config Release

MISTAKE on 27, SINGLE-THREADED and slow to build see -j 16 below for faster build

     28  cmake --build build --config Release -j 16
     29  ls
     30  ls build
     31  find . -name "llama.server"
     32  find . -name "llama"
     33  ls build/bin/
     34  cd build/bin/
     35  ls
     36  ./llama-server -hf ggml-org/gpt-oss-120b-GGUF -c 0 --jinja

MISTAKE, didn't specify the port number for the llama-server

     37  clear;history
     38  ./llama-server -hf Qwen/Qwen3-VL-30B-A3B-Thinking -c 0 --jinja --port 11434
     39  ./llama-server -hf Qwen/Qwen3-VL-30B-A3B-Thinking.gguf -c 0 --jinja --port 11434
     40  ./llama-server -hf Qwen/Qwen3-VL-30B-A3B-Thinking-GGUF -c 0 --jinja --port 11434
     41  clear;history

I switched to qwen3 vl because I need a multimodal model for that day's experiment. Lines 38 and 39 show me not using the right name for the model. I like how llama.cpp can download and run models directly off of huggingface.

Then pointed my browser at http//:localhost:22434 on my local box and had the normal browser window where I could upload files and use the chat interface with the model. That also gives you an openai api-compatible endpoint. It was all I needed for what I was doing that day. I spent a grand total of $4 that day doing the setup and running some NLP-oriented prompts for a few hours.

bigiain · 2025-12-10T07:17:02 1765351022

Thanks, much appreciated.

tgtweak · 2025-12-09T22:29:07 1765319347

dual 3090's (24GB each) on 8x+8x pcie has been a really reliable setup for me (with nvlink bridge... even though it's relatively low bandwidth compared to tesla nvlink, it's better than going over pcie!)

48GB of vram and lots of cuda cores, hard to beat this value atm.

If you want to go even further, you can get an 8x V100 32GB server complete with 512GB ram and nvlink switching for $7000 USD from unixsurplus (ebay.com/itm/146589457908) which can run even bigger models and with healthy throughput. You would need 240V power to run that in a home lab environment though.

lostmsu · 2025-12-09T23:23:18 1765322598

V100 is outdated (no bf16, dropped in CUDA 13) and power hungry (8 cards 3 years continuous use are about $12k of electricity).

tgtweak · 2025-12-10T19:25:21 1765394721

Depends where you are plugging them in - but yes they are older gen - despite this, 8xV100 will outperform most of what you can buy for that price simply by way of memory and nvlink bandwidth. If you want to practically run a local model that takes 200GB of memory (Devstral-2-123B-Instruct-2512 for example or GPT-OSS-120B with long context window) without resorting to aggressive ggufs or memory swapping, you don't have many cheaper options. You can also parallelize several models on one node to get some additional throughput for bulk jobs.

monster_truck · 2025-12-09T19:22:37 1765308157

I'd throw a 7900xtx in an AM4 rig with 128gb of ddr4 (which is what I've been using for the past two years)

Fuck nvidia