llm install llm-mistral llm mistral refresh llm -m mistral/devstral-2512 "Genera...

Jimmc414 · 2025-12-09T18:57:36 1765306656

We are getting to the point that its not unreasonable to think that "Generate an SVG of a pelican riding a bicycle" could be included in some training data. It would be a great way to ensure an initial thumbs up from a prominent reviewer. It's a good benchmark but it seems like it would be a good idea to include an additional random or unannounced similar test to catch any benchmaxxing.

simonw · 2025-12-09T19:19:37 1765307977

I wrote about that possibility here: https://simonwillison.net/2025/Nov/13/training-for-pelicans-...

armcat · 2025-12-10T12:17:32 1765369052

Hi Simon! Love your work! Our of curiosity - how many pelican-cycling samples do you produce. Curious about the variance here. Thanks!

simonw · 2025-12-10T13:09:43 1765372183

I've lost count, but there are 85 posts with that tag here: https://simonwillison.net/tags/pelican-riding-a-bicycle/

I need to extract them all into a formal collection.

karambir · 2025-12-10T13:40:20 1765374020

I think the parent poster might be asking about generations per model-test. Atleast that's what I understood.

huxley · 2025-12-10T16:18:17 1765383497

A coffee-table book? A Natural History of SVG Pelicans

jgalt212 · 2025-12-10T13:10:58 1765372258

Aiden is perhaps misinformed. From a Bing search performed just now.

> Yes, I am familiar with the "pelican riding a bicycle" SVG generation test. It is a benchmark for evaluating the ability of AI models, particularly large language models (LLMs) and multi-modal systems, to generate original, high-quality SVG vector graphics based on a deliberately unusual and complex prompt. The benchmark was popularized by Simon Willison, who selected the prompt because:

100721 · 2025-12-10T15:10:46 1765379446

Web search-based RAG is very different from having something embedded in a model's training data, though.

jgalt212 · 2025-12-10T18:59:17 1765393157

ChatGPT website gives a similar answer. Are they running RAG, or the model?

> Yes — I’m familiar with the “pelican riding a bicycle” SVG generation test.

> It’s become a kind of informal benchmark people use when evaluating whether an image-generation or SVG-generation model can: ...

th0ma5 · 2025-12-09T19:27:40 1765308460

[flagged]

vanschelven · 2025-12-09T20:40:30 1765312830

Whatever you think Jimmc414's _concerns_ are (they merely state a possibility) Simon enumerates a number of concerns in the linked article, and then addresses those. So I'm not sure why you think this is so.

vnvnff · 2025-12-10T10:21:47 1765362107

It's a pattern: https://news.ycombinator.com/item?id=44725190

dugidugout · 2025-12-09T19:39:17 1765309157

Condescending and disrespectful to whom? Everybody wholsale? This doesnt seem reasonable? Please elaborate.

bravetraveler · 2025-12-09T20:52:02 1765313522

Not sure if I'd use the same descriptions so pointedly, but I can see what they mean.

It's perfectly fine to link for convenience, but it does feel a little disrespectful/SEO-y to not 'continue the conversation'. A summary in the very least, how exactly it pertains. Sell us.

In a sense, link-dropping [alone] is saying: "go read this and establish my rhetorical/social position, I'm done here"

Imagine meeting an author/producer/whatever you liked. You'd want to talk about their work, how they created it, the impact it had, and so on. Now imagine if they did that... or if they waved their hand vaguely at a catalog.

simonw · 2025-12-09T22:36:55 1765319815

I've genuinely been answering the question "what if the labs are training on your pelican benchmark" 3-4 times a week for several months at this point. I wrote that piece precisely so I didn't have to copy and paste the same arguments into dozens of different conversations.

bravetraveler · 2025-12-09T22:46:38 1765320398

Oh, no. Does this policing job pay well? /s Seriously: less is more, trust the process, any number of platitudes work here. Who are you defending against? Readers, right? You wrote your thing, defended it with more of the thing. It'll permeate. Or it won't. Does it matter?

You could be done, nothing is making you defend this (sorry) asinine benchmark across the internet. Not trying to (m|y)uck your yum, or whatever.

Remember, I did say linking for convenience is fine. We're belaboring the worst reading in comments. Inconsequential, unnecessary heartburn. Link the blog posts together and call it good enough.

Barbing · 2025-12-10T00:51:20 1765327880

Surprised to see snark re: what I thought was a standard practice (linking FAQs, essentially).

I hadn’t seen the post. It was relevant. I just read it. Lucky Ten Thousand can read it next time even though I won’t.

Simon has never seemed annoying so unlike other comments that might worry me (even “Opus made this” even though it’s cool but I’m concerned someone astroturfed), that comment would’ve never raised my eyebrows. He’s also dedicated and I love he devotes his time to a new field like this where it’s great to have attempts at benchmarks, folks cutting through chaff, etc.

bravetraveler · 2025-12-10T00:54:15 1765328055

The specific 'question' is a promise to catch training on more publicly available data, and to expect more blog links copied 'into dozens of different conversations'... Jump for joy. Stop the presses. Oops, snarky again :)

Yes, the LLM people will train on this. They will train on absolutely everything [as they have]. The comments/links prioritize engagement over awareness. My point, I suppose, if I had one is that this blogosphere can add to the chaff. I'm glad to see Simon here often/interested.

Aside: all this concern about over-fitting just reinforces my belief these things won't take the profession any time soon. Maybe the job.

simonw · 2025-12-09T23:24:26 1765322666

You don't have to convince me the pelican riding a bicycle SVG benchmark is asinine. That's kind of the point!

bravetraveler · 2025-12-09T23:25:35 1765322735

Having read the followup post being linked, I'm even more confused. Commenting or, really, anything seems even less worthwhile. That's my point.

You bring the benchmark and anticipated their... cheesing, with a promise to catch them on it. Cool announcement of an announcement. Just do that [or don't]. In a hippy sense, this is no longer yours. It's out there. Like everything else anyone wrote.

Let the LLM people train on your test. Catch them as claimed. Publish again. Huzzah, industry without overtime in the comments. It makes sense/cents to position yourself this way :)

Obviously they're going to train on anything they can get. They did. Mouse, meet cat. Some of us in the house would love it if y'all would keep it down! This is 90s rap beef all over again

charcircuit · 2025-12-09T23:47:34 1765324054

If you want a summary you can have your ai assistant summarize the link.

bravetraveler · 2025-12-09T23:49:41 1765324181

Woooooosh, please see if an LLM can help you. I'm not getting paid for this

renewiltord · 2025-12-10T12:11:46 1765368706

It is SEO-y and I’m sure no small impulse is to drive traffic to his website since he’s primarily an AI influencer.

However, there are always people who are “native” to a platform and field. Pieter Levels is native to Twitter and the nomad community. Swyx is native to Twitter/HN and the devtools community. And simonw is native to at least HN and the LLM-interest community. And various streamers and onlyfans creators do the same with theirs.

Through some degree of releasing things that whatever that community values they build a relationship that allows them greater freedom in participating there. It does create a positive feedback cycle for them (and hopefully the community) that most of them will try to parlay into something else: Levels and the OnlyFans creators are probably best at this monetization of reputation but each of them is doing this. One success step for simonw would be “Creator of Pelican LLM benchmark”.

Once you’ve breached some stable point in the community the norms are somewhat relaxed. But it’s not easy to do that. You have to produce some extraordinary volume of things that people value.

I think, tbh, tptacek here could most effectively monetize if he decided to. But he doesn’t appear to want to so he’s just a participant not an influencer so to speak. Whereas someone like Levels or simonw is both.

It’s just creator economy stuff. Meta discussions like this always pop up. But ultimately simonw is past the threshold of trust. There are people who say “wtf? Why is levels making $50k/mo on a stupid vibe-coded flying game?”

It ain’t the game. It’s the following before the game. The resource is the audience.

bravetraveler · 2025-12-10T12:26:34 1765369594

Thanks for posting, I agree. I regret this being taken so pointedly at Simon, just a player in the game.

The best guy spinning the sign puts some effort in, or more crass, the best strippers make you believe.

tptacek · 2025-12-11T23:58:25 1765497505

Wait how do I monetize? Am I leaving money on the table?

dugidugout · 2025-12-10T18:02:20 1765389740

Well put and thank you for adding much depth here!

tomrod · 2025-12-09T21:33:50 1765316030

Hell, I would consider myself graced that simonw, yes, THAT simonw, the LLM whisperer, took time out of his busy schedule to send me to a discussion I might have expressed interest in.

bravetraveler · 2025-12-09T21:45:46 1765316746

> send me to a discussion I might have expressed interest in

No, no, remember? Points to the blog you were already reading! Working diligently to build a brand: podcast, paid newsletter, the works.

tomrod · 2025-12-10T04:12:40 1765339960

I wasn't speaking to this interaction, and my point is genuine. Simonw has done fantastic work in the LLM space

bravetraveler · 2025-12-10T12:45:43 1765370743

... and my point remains: he's fine. Could be better. If he does grace us, he can choose to bait the hook more effectively. Or not. The stakes are silly-low.

This interaction is, effectively, a link dropped with an announcement of an announcement. For what has already occurred. Over-fitting, training? You don't say.

If I wanted to be more of an ass, I'd look to argue about hype generation. But I don't, I appreciate any honest effort, which I believe for Simon.

th0ma5 · 2025-12-09T19:42:23 1765309343

No, when did I say that?

dugidugout · 2025-12-09T19:56:51 1765310211

It isn't clear what you said.

You asserted a pattern of conduct on the user simonw:

> I think constantly replying to everybody with some link which doesn't address their concerns

Then claimed that conduct was:

> condescending and disrespectful.

I am asking you to elaborate to whom simonw is condescending and disrespecting. I don't see how it follows.

thatwasunusual · 2025-12-09T23:51:41 1765324301

> We are getting to the point that its not unreasonable to think that "Generate an SVG of a pelican riding a bicycle" could be included in some training data.

I may be stupid, but _why_ is this prompt used as a benchmark? I mean, pelicans _can't_ ride a bicycle, so why is it important for "AI" to show that they can (at least visually)?

The "wine glass problem"[0] - and probably others - seems to me to be a lot more relevant...?

[0] https://medium.com/@joe.richardson.iii/the-curious-case-of-t...

simonw · 2025-12-10T00:15:45 1765325745

The fact that pelicans can't ride bicycles is pretty much the point of the benchmark! Asking an LLM to draw something that's physically impossible means it can't just "get it right" - seeing how different models (especially at different sizes) handle the problem is surprisingly interesting.

Honestly though, the benchmark was originally meant to be a stupid joke.

I only started taking it slightly more seriously about six months ago, when I noticed that the quality of the pelican drawings really did correspond quite closely to how generally good the underlying models were.

If a model draws a really good picture of a pelican riding a bicycle there's a solid chance it will be great at all sorts of other things. I wish I could explain why that was!

If you start here and scroll through and look at the progression of pelican on bicycle images it's honestly spooky how well they match the vibes of the models they represent: https://simonwillison.net/2025/Jun/6/six-months-in-llms/#ai-...

So ever since then I've continue to get models to draw pelicans. I certainly wouldn't suggest anyone take serious decisions on model usage based on my stupid benchmark, but it's a fun first-day initial impression thing and it appears to be a useful signal for which models are worth diving into in more detail.

thatwasunusual · 2025-12-10T02:19:32 1765333172

> If a model draws a really good picture of a pelican riding a bicycle there's a solid chance it will be great at all sorts of other things.

Why?

If I hired a worker that was really good at drawing pelicans riding a bike, it wouldn't tell me anything about his/her other qualities?!

suspended_state · 2025-12-10T07:22:03 1765351323

Your comment is funny, but please note: it's not drawing a pelican riding a bike, it's describing in SVG a pelican riding a bike. Your candidate would at least displays some knowledge of the SVG specs.

simonw · 2025-12-10T03:10:16 1765336216

I wish I knew why. I didn't think it would be a useful indicator of model skills at all when I started doing it, but over time the pattern has held that performance on pelican riding a bicycle is a good indicator of performance on other tasks.

vikramkr · 2025-12-10T04:43:30 1765341810

The difference is that the worker you hire would be a human being and not a large matrix multiplication that had parameters optimized by a a gradient descent process and embeds concepts in a higher dimensional vector space that results in all sorts of weird things like subliminal learning (https://alignment.anthropic.com/2025/subliminal-learning/).

It's not a human intelligence - it's a totally different thing, so why would the same test that you use to evaluate human abilities apply here?

Also more directly the "all sorts of other things" we want llms to be good at often involve writing code/spatial reasoning/world understanding which creating an svg of a pelican riding a bicycle very very directly evaluates so it's not even that surprising?

falcor84 · 2025-12-10T10:22:21 1765362141

For better or worse, a lot of job interviews actually do use contrived questions like this, such as the infamous "how many golf balls can you fit in a 747?"

theshrike79 · 2025-12-10T11:48:57 1765367337

What if the employee can draw a bike and a pelican, but not a pelican on a bike?

jtbaker · 2025-12-10T02:53:30 1765335210

a posteriori knowledge. the pelican isn't the point, it's just amusing. the point is that Simon has seen a correlation between this skill and and the model's general capabilities.

theshrike79 · 2025-12-10T11:48:06 1765367286

It's just a variant of the wine glass - something that doesn't exist in the source material as-is. I have a few of my own I don't share publicly.

Basically in my niche I _know_ there are no original pictures of specific situations and my prompts test whether the LLM is "creative" enough to combine multiple sources into one that matches my prompt.

I think of if like this: there are three things I want in the picture (more actually, but for the example assume 3). All three are really far from each other in relevance, in the very corner of an equilateral triangle (in the vector space of the LLM's "brain"). What I'm asking it to do is in the middle of all three things.

Every model so far tends to veer towards one or two of the points more than others because it can't figure out how to combine them all into one properly.

wisty · 2025-12-10T00:18:55 1765325935

It's not nessessarily the best benchmark, it's a popular one, probably because it's funny.

Yes it's like the wine glass thing.

Also it's kind of got depth. Does it draw the pelican and the bicycle? Can the penguin reach the peddles? How?

I can imagine a really good AI finding a funny or creative or realistic way for the penguin to reach the peddles.

An slightly worse AI will do an OK job, maybe just making the bike small or the legs too long.

An OK AI will draw a penguin on top of a bicycle and just call it a day.

It's not as binary as the wine glass example.

thatwasunusual · 2025-12-10T02:16:36 1765332996

> It's not nessessarily the best benchmark, it's a popular one, probably because it's funny.

> Yes it's like the wine glass thing.

No, it's not!

That's part of my point; the wine glass scenario is a _realistic_ scenario. The pelican riding a bike is not. It's a _huge_ difference. Why should we measure intelligence (...) in regards to something that is realistic and something that is unrealistic?

I just don't get it.

Fnoord · 2025-12-10T05:50:48 1765345848

> the wine glass scenario is a _realistic_ scenario

It is unrealistic because if you go to a restaurant, you don't get served a glass like that. It is frowned upon (alcohol is a drug, after all) and impractical (wine stains are annoying) to fill a glass of wine as such.

A pelican riding a bike, on the other hand, is realistic in a scenario because of TV for children. Example from 1950's animation/comic involving a pelican [1].

[1] https://en.wikipedia.org/wiki/The_Adventures_of_Paddy_the_Pe...

mzl · 2025-12-10T12:45:44 1765370744

A better reason why wine glasses are not filled like that is that wine glasses are designed to capture the aroma of the wine.

Since people look at a glass of wine and judge how much "value" they got based partly on how much wine it looks like, many bars and restaurants choose bad wine-glasses (for the purpose of enjoying wine) that are smalle and thus can be fulled more.

vikramkr · 2025-12-10T04:44:56 1765341896

If the thing we're measuring is a the ability to write code, visually reason, and handle extrapolating to out of sample prompts, then why shouldn't we evaluate it by asking it to write code to generate a strange image that it wouldn't have seen in its training data?

Workaccount2 · 2025-12-10T00:28:31 1765326511

It would be easy to out models that train on the bike pelican, because they would probably suck at the kayaking bumblebee.

So far though, the models good at bike pelican are also good at kayak bumblebee, or whatever other strange combo you can come up with.

So if they are trying to benchmaxx by making SVG generation stronger, that's not really a miss, is it?

majormajor · 2025-12-10T01:29:15 1765330155

That depends on if "SVG generation" is a particularly useful LLM/coding model skill outside of benchmarking. I.e., if they make that stronger with some params that otherwise may have been used for "rust type system awareness" or somesuch, it might be a net loss outside of the benchmarks.

th0ma5 · 2025-12-09T19:32:01 1765308721

If this had any substance then it could be criticized, which is what they're trying to avoid.

Etheryte · 2025-12-09T21:58:01 1765317481

How? There's no way for you to verify if they put synthetic data for that into the dataset or not.

0cf8612b2e1e · 2025-12-10T00:26:32 1765326392

I assume all of the models also have variations on, “how many ‘r’s in strawberry”.

theshrike79 · 2025-12-10T11:42:49 1765366969

The easiest way to fix these is give the model an environment to run code.

Any model can easily one-shot a python script that can count the occurrence of any letter anywhere and return the result.

It's just a tooling issue. You really can't "train" an LLM to do it because tokenisation and ... stuff.

0cf8612b2e1e · 2025-12-10T16:38:30 1765384710

I am not convinced they are executing code. Otherwise I would expect LLMs to not frequently guess the result of math questions.

Of course you could train it. Some quick scripting to find all words with repeat letters, build up sample sentences (aardvark has three a,) and you have hard coded the answer to simple questions that make your LLM look stupid.

theshrike79 · 2025-12-10T19:52:43 1765396363

I have personally observed Grok running Python code in a chat to determine the current date so it could accurately tell me whether the 20th is a Friday (it wasn't in that specific month)

.. it did that in a story prompt that didn't happen in a) our world b) the current time =)

baq · 2025-12-09T18:09:50 1765303790

but can it recreate the spacejam 1996 website? https://www.spacejam.com/1996/jam.html

aschobel · 2025-12-09T19:22:01 1765308121

in case folks are missing the context

https://news.ycombinator.com/item?id=46183294

lagniappe · 2025-12-09T18:53:59 1765306439

That is not a meaningful metric given that we don't live in 1996 and neither do our web standards.

tarsinge · 2025-12-09T19:01:04 1765306864

In what year was it meaningful to have pelicans riding bicycles?

lagniappe · 2025-12-09T19:08:01 1765307281

SVG is a current standard. Do not be coy just to satisfy your urge to disagree.

tarsinge · 2025-12-09T19:40:51 1765309251

The website is live and renders correctly on my Safari mobile: https://www.spacejam.com/1996/

I may have missed something but where are we saying the website should be recreated with 1996 tech or specs? The model is free to use any modern CSS, there is no technical limitations. So yes I genuinely think it is a good generalization test, because it is indeed not in the training set, and yet it is easy an easy task for a human developer.

locallost · 2025-12-09T19:15:28 1765307728

The point stands. Whether or not the standard is current has no relevance for the ability of the "AI" to produce the requested content. Either it can or can't.

lagniappe · 2025-12-09T19:17:26 1765307846

https://news.ycombinator.com/item?id=46183673

locallost · 2025-12-10T05:32:10 1765344730

> Ergo, models for the most part will only have a cursory knowledge of a spec that your browser will never be able to parse because that isn't the spec that won.

Browsers are able to parse a webpage from 1996. I don't know what the argument in the linked comment is about, but in this one, we discuss the relevance of creating a 1996 page vs a pelican on a a bicycle in SVG.

Here is Gemini when asked how to build a webpage from 1996. Seems pretty correct. In general I dislike grand statements that are difficult to back up. In your case, if models have only a cursory knowledge of something (what does this mean in the context of LLMs anyway), what exactly they were trained on etc.

The shortened Gemini answer, the detailed version you can ask for yourself:

Layout via Tables: Without modern CSS, layouts were created using complex, nested HTML tables and invisible "spacer GIFs" to control white space.

Framesets: Windows were often split into independent sections (like a static sidebar and a scrolling content window) using Frames.

Inline Styling: Formatting was not centralized; fonts and colors were hard-coded individually on every element using the <font> tag.

Low-Bandwidth Design: Visuals relied on tiny tiled background images, animated GIFs, and the limited "Web Safe" color palette.

CGI & Java: Backend processing was handled by Perl/CGI scripts, while advanced interactivity used slow-loading Java Applets.

utopiah · 2025-12-09T20:50:08 1765313408

> neither do our web standards

I'd be curious about that actually, feel like W3C specifications (I don't mean browser support of them) rarely deprecate and precisely try to keep the Web running.

baq · 2025-12-09T19:10:49 1765307449

Yes, now please prepare an email template which renders fine in outlook using modern web standards. Write it up if you succeed, front page of HN guaranteed!

tomashubelbauer · 2025-12-09T19:04:18 1765307058

The parent comment is a reference to a different story that was on the HN home page yesterday where someone attempted that with Claude.

lagniappe · 2025-12-09T19:07:24 1765307244

Yes, and I had a lengthier response in that thread explaining why this isn't a useful metric.

https://news.ycombinator.com/item?id=46183673

MLgulabio · 2025-12-10T09:30:21 1765359021

It was a joke reference...

cpursley · 2025-12-09T16:52:29 1765299149

Skipped the bicycle entirely and upgraded to a sweet motorcycle :)

aorth · 2025-12-09T16:57:19 1765299439

Looks like a Cybertruck actually!

BudaDude · 2025-12-09T17:58:34 1765303114

I was thinking a Warthog

https://www.halopedia.org/Warthog

lubujackson · 2025-12-09T20:30:57 1765312257

The Batman motorcycle!

troyvit · 2025-12-09T21:24:08 1765315448

I'm Pelicanman </raspy voice>

taneq · 2025-12-10T07:48:24 1765352904

The Dark Noot.

willahmad · 2025-12-09T17:20:46 1765300846

I think this benchmark could be slightly misleading to assess coding model. But still very good result.

Yes, SVG is code, but not in a sense of executable with verifiable inputs and outputs.

jstummbillig · 2025-12-09T18:35:10 1765305310

I love that we are earnestly contemplating the merits of the pelican benchmark. What a timeline.

andrepd · 2025-12-09T21:10:06 1765314606

It's not even halfway up the list of inane things of the AI hype cycle.

hdjrudni · 2025-12-10T02:56:13 1765335373

But it does have a verifiable output, no more or less than HTML+CSS. Not sure what you mean by "input" -- it's not a function that takes in parameters if that's what you're getting at, but not every app does.

iberator · 2025-12-09T19:37:47 1765309067

Where did you get llm tool from?!

fauigerzigerk · 2025-12-09T19:45:43 1765309543

He made it: https://github.com/simonw/llm

techsystems · 2025-12-09T22:49:46 1765320586

Cool! I can't find it on the read me, but can it run Qwen locally?

simonw · 2025-12-09T23:04:55 1765321495

The best way to do that at the moment is using the llm-ollama plugin.

lacoolj · 2025-12-10T18:13:28 1765390408

How did you run a 123B model locally? Or did you do this on a GPU host somewhere? If so, what spec was it?

simonw · 2025-12-10T22:04:50 1765404290

I haven't run the 123B one locally yet. I used Mistral's own API models for this.

felixg3 · 2025-12-09T17:33:17 1765301597

Is it really an svg if it’s just embedded base64 of a jpg

joombaga · 2025-12-09T19:07:06 1765307226

You were seeing the base64 image tag output at the bottom. The SVG input is at the top.

samgutentag · 2025-12-10T17:44:36 1765388676

"Generate an SVG of a pelican riding a bicycle" is the new "but can it run Crysis"