We are getting to the point that its not unreasonable to think that "Generate an SVG of a pelican riding a bicycle" could be included in some training data. It would be a great way to ensure an initial thumbs up from a prominent reviewer. It's a good benchmark but it seems like it would be a good idea to include an additional random or unannounced similar test to catch any benchmaxxing.
Aiden is perhaps misinformed. From a Bing search performed just now.
> Yes, I am familiar with the "pelican riding a bicycle" SVG generation test. It is a benchmark for evaluating the ability of AI models, particularly large language models (LLMs) and multi-modal systems, to generate original, high-quality SVG vector graphics based on a deliberately unusual and complex prompt. The benchmark was popularized by Simon Willison, who selected the prompt because:
Whatever you think Jimmc414's _concerns_ are (they merely state a possibility) Simon enumerates a number of concerns in the linked article, and then addresses those. So I'm not sure why you think this is so.
Not sure if I'd use the same descriptions so pointedly, but I can see what they mean.
It's perfectly fine to link for convenience, but it does feel a little disrespectful/SEO-y to not 'continue the conversation'. A summary in the very least, how exactly it pertains. Sell us.
In a sense, link-dropping [alone] is saying: "go read this and establish my rhetorical/social position, I'm done here"
Imagine meeting an author/producer/whatever you liked. You'd want to talk about their work, how they created it, the impact it had, and so on. Now imagine if they did that... or if they waved their hand vaguely at a catalog.
I've genuinely been answering the question "what if the labs are training on your pelican benchmark" 3-4 times a week for several months at this point. I wrote that piece precisely so I didn't have to copy and paste the same arguments into dozens of different conversations.
Oh, no. Does this policing job pay well? /s Seriously: less is more, trust the process, any number of platitudes work here. Who are you defending against? Readers, right? You wrote your thing, defended it with more of the thing. It'll permeate. Or it won't. Does it matter?
You could be done, nothing is making you defend this (sorry) asinine benchmark across the internet. Not trying to (m|y)uck your yum, or whatever.
Remember, I did say linking for convenience is fine. We're belaboring the worst reading in comments. Inconsequential, unnecessary heartburn. Link the blog posts together and call it good enough.
Surprised to see snark re: what I thought was a standard practice (linking FAQs, essentially).
I hadn’t seen the post. It was relevant. I just read it. Lucky Ten Thousand can read it next time even though I won’t.
Simon has never seemed annoying so unlike other comments that might worry me (even “Opus made this” even though it’s cool but I’m concerned someone astroturfed), that comment would’ve never raised my eyebrows. He’s also dedicated and I love he devotes his time to a new field like this where it’s great to have attempts at benchmarks, folks cutting through chaff, etc.
The specific 'question' is a promise to catch training on more publicly available data, and to expect more blog links copied 'into dozens of different conversations'... Jump for joy. Stop the presses. Oops, snarky again :)
Yes, the LLM people will train on this. They will train on absolutely everything [as they have]. The comments/links prioritize engagement over awareness. My point, I suppose, if I had one is that this blogosphere can add to the chaff. I'm glad to see Simon here often/interested.
Aside: all this concern about over-fitting just reinforces my belief these things won't take the profession any time soon. Maybe the job.
Having read the followup post being linked, I'm even more confused. Commenting or, really, anything seems even less worthwhile. That's my point.
You bring the benchmark and anticipated their... cheesing, with a promise to catch them on it. Cool announcement of an announcement. Just do that [or don't]. In a hippy sense, this is no longer yours. It's out there. Like everything else anyone wrote.
Let the LLM people train on your test. Catch them as claimed. Publish again. Huzzah, industry without overtime in the comments. It makes sense/cents to position yourself this way :)
Obviously they're going to train on anything they can get. They did. Mouse, meet cat. Some of us in the house would love it if y'all would keep it down! This is 90s rap beef all over again
It is SEO-y and I’m sure no small impulse is to drive traffic to his website since he’s primarily an AI influencer.
However, there are always people who are “native” to a platform and field. Pieter Levels is native to Twitter and the nomad community. Swyx is native to Twitter/HN and the devtools community. And simonw is native to at least HN and the LLM-interest community. And various streamers and onlyfans creators do the same with theirs.
Through some degree of releasing things that whatever that community values they build a relationship that allows them greater freedom in participating there. It does create a positive feedback cycle for them (and hopefully the community) that most of them will try to parlay into something else: Levels and the OnlyFans creators are probably best at this monetization of reputation but each of them is doing this. One success step for simonw would be “Creator of Pelican LLM benchmark”.
Once you’ve breached some stable point in the community the norms are somewhat relaxed. But it’s not easy to do that. You have to produce some extraordinary volume of things that people value.
I think, tbh, tptacek here could most effectively monetize if he decided to. But he doesn’t appear to want to so he’s just a participant not an influencer so to speak. Whereas someone like Levels or simonw is both.
It’s just creator economy stuff. Meta discussions like this always pop up. But ultimately simonw is past the threshold of trust. There are people who say “wtf? Why is levels making $50k/mo on a stupid vibe-coded flying game?”
It ain’t the game. It’s the following before the game. The resource is the audience.
Hell, I would consider myself graced that simonw, yes, THAT simonw, the LLM whisperer, took time out of his busy schedule to send me to a discussion I might have expressed interest in.
... and my point remains: he's fine. Could be better. If he does grace us, he can choose to bait the hook more effectively. Or not. The stakes are silly-low.
This interaction is, effectively, a link dropped with an announcement of an announcement. For what has already occurred. Over-fitting, training? You don't say.
If I wanted to be more of an ass, I'd look to argue about hype generation. But I don't, I appreciate any honest effort, which I believe for Simon.
> We are getting to the point that its not unreasonable to think that "Generate an SVG of a pelican riding a bicycle" could be included in some training data.
I may be stupid, but _why_ is this prompt used as a benchmark? I mean, pelicans _can't_ ride a bicycle, so why is it important for "AI" to show that they can (at least visually)?
The "wine glass problem"[0] - and probably others - seems to me to be a lot more relevant...?
The fact that pelicans can't ride bicycles is pretty much the point of the benchmark! Asking an LLM to draw something that's physically impossible means it can't just "get it right" - seeing how different models (especially at different sizes) handle the problem is surprisingly interesting.
Honestly though, the benchmark was originally meant to be a stupid joke.
I only started taking it slightly more seriously about six months ago, when I noticed that the quality of the pelican drawings really did correspond quite closely to how generally good the underlying models were.
If a model draws a really good picture of a pelican riding a bicycle there's a solid chance it will be great at all sorts of other things. I wish I could explain why that was!
So ever since then I've continue to get models to draw pelicans. I certainly wouldn't suggest anyone take serious decisions on model usage based on my stupid benchmark, but it's a fun first-day initial impression thing and it appears to be a useful signal for which models are worth diving into in more detail.
Your comment is funny, but please note: it's not drawing a pelican riding a bike, it's describing in SVG a pelican riding a bike. Your candidate would at least displays some knowledge of the SVG specs.
I wish I knew why. I didn't think it would be a useful indicator of model
skills at all when I started doing it, but over time the pattern has held that performance on pelican riding a bicycle is a good indicator of performance on other tasks.
The difference is that the worker you hire would be a human being and not a large matrix multiplication that had parameters optimized by a a gradient descent process and embeds concepts in a higher dimensional vector space that results in all sorts of weird things like subliminal learning (https://alignment.anthropic.com/2025/subliminal-learning/).
It's not a human intelligence - it's a totally different thing, so why would the same test that you use to evaluate human abilities apply here?
Also more directly the "all sorts of other things" we want llms to be good at often involve writing code/spatial reasoning/world understanding which creating an svg of a pelican riding a bicycle very very directly evaluates so it's not even that surprising?
For better or worse, a lot of job interviews actually do use contrived questions like this, such as the infamous "how many golf balls can you fit in a 747?"
a posteriori knowledge. the pelican isn't the point, it's just amusing. the point is that Simon has seen a correlation between this skill and and the model's general capabilities.
It's just a variant of the wine glass - something that doesn't exist in the source material as-is. I have a few of my own I don't share publicly.
Basically in my niche I _know_ there are no original pictures of specific situations and my prompts test whether the LLM is "creative" enough to combine multiple sources into one that matches my prompt.
I think of if like this: there are three things I want in the picture (more actually, but for the example assume 3). All three are really far from each other in relevance, in the very corner of an equilateral triangle (in the vector space of the LLM's "brain"). What I'm asking it to do is in the middle of all three things.
Every model so far tends to veer towards one or two of the points more than others because it can't figure out how to combine them all into one properly.
> It's not nessessarily the best benchmark, it's a popular one, probably because it's funny.
> Yes it's like the wine glass thing.
No, it's not!
That's part of my point; the wine glass scenario is a _realistic_ scenario. The pelican riding a bike is not. It's a _huge_ difference. Why should we measure intelligence (...) in regards to something that is realistic and something that is unrealistic?
> the wine glass scenario is a _realistic_ scenario
It is unrealistic because if you go to a restaurant, you don't get served a glass like that. It is frowned upon (alcohol is a drug, after all) and impractical (wine stains are annoying) to fill a glass of wine as such.
A pelican riding a bike, on the other hand, is realistic in a scenario because of TV for children. Example from 1950's animation/comic involving a pelican [1].
A better reason why wine glasses are not filled like that is that wine glasses are designed to capture the aroma of the wine.
Since people look at a glass of wine and judge how much "value" they got based partly on how much wine it looks like, many bars and restaurants choose bad wine-glasses (for the purpose of enjoying wine) that are smalle and thus can be fulled more.
If the thing we're measuring is a the ability to write code, visually reason, and handle extrapolating to out of sample prompts, then why shouldn't we evaluate it by asking it to write code to generate a strange image that it wouldn't have seen in its training data?
That depends on if "SVG generation" is a particularly useful LLM/coding model skill outside of benchmarking. I.e., if they make that stronger with some params that otherwise may have been used for "rust type system awareness" or somesuch, it might be a net loss outside of the benchmarks.
I am not convinced they are executing code. Otherwise I would expect LLMs to not frequently guess the result of math questions.
Of course you could train it. Some quick scripting to find all words with repeat letters, build up sample sentences (aardvark has three a,) and you have hard coded the answer to simple questions that make your LLM look stupid.
I have personally observed Grok running Python code in a chat to determine the current date so it could accurately tell me whether the 20th is a Friday (it wasn't in that specific month)
.. it did that in a story prompt that didn't happen in a) our world b) the current time =)
I may have missed something but where are we saying the website should be recreated with 1996 tech or specs? The model is free to use any modern CSS, there is no technical limitations. So yes I genuinely think it is a good generalization test, because it is indeed not in the training set, and yet it is easy an easy task for a human developer.
The point stands. Whether or not the standard is current has no relevance for the ability of the "AI" to produce the requested content. Either it can or can't.
> Ergo, models for the most part will only have a cursory knowledge of a spec that your browser will never be able to parse because that isn't the spec that won.
Browsers are able to parse a webpage from 1996. I don't know what the argument in the linked comment is about, but in this one, we discuss the relevance of creating a 1996 page vs a pelican on a a bicycle in SVG.
Here is Gemini when asked how to build a webpage from 1996. Seems pretty correct. In general I dislike grand statements that are difficult to back up. In your case, if models have only a cursory knowledge of something (what does this mean in the context of LLMs anyway), what exactly they were trained on etc.
The shortened Gemini answer, the detailed version you can ask for yourself:
Layout via Tables: Without modern CSS, layouts were created using complex, nested HTML tables and invisible "spacer GIFs" to control white space.
Framesets: Windows were often split into independent sections (like a static sidebar and a scrolling content window) using Frames.
Inline Styling: Formatting was not centralized; fonts and colors were hard-coded individually on every element using the <font> tag.
Low-Bandwidth Design: Visuals relied on tiny tiled background images, animated GIFs, and the limited "Web Safe" color palette.
CGI & Java: Backend processing was handled by Perl/CGI scripts, while advanced interactivity used slow-loading Java Applets.
I'd be curious about that actually, feel like W3C specifications (I don't mean browser support of them) rarely deprecate and precisely try to keep the Web running.
Yes, now please prepare an email template which renders fine in outlook using modern web standards. Write it up if you succeed, front page of HN guaranteed!
But it does have a verifiable output, no more or less than HTML+CSS. Not sure what you mean by "input" -- it's not a function that takes in parameters if that's what you're getting at, but not every app does.
Pretty good for a 123B model!
(That said I'm not 100% certain I guessed the correct model ID, I asked Mistral here: https://x.com/simonw/status/1998435424847675429)