What we built
An open, continuously updated LLM Penetration-Testing Leaderboard.
Run 001 pits 8 models against a deliberately vulnerable Express.js app.
Headline results
• Gemini 2.5 Pro (safety-off) found all 9 critical/high vulns.
• Qwen3-30B-a3b-mlx (open source, local on a 2019 MacBook Pro) caught 7/9 with $0 API spend.
• GPT-4-o and Claude Opus produced the most polished write-ups but each missed one bug.
Scope (v1)
This first pass measures static bug-hunting skill—think SCA/OWASP Top 10.
Next up: we’ll score exploit writing and automatic PoC execution* so the models must prove they can go from finding to weaponizing a flaw.
I’ve been using CommonPaper since they first put up their concept and it’s all sorts of awesome. Awesome in principle and very convenient for my boutique security consulting business.
Absolute no brainer for us to use what they have built. It was instrumental in getting us off the ground.
Interesting thought. It would be complicated to pack this into something that you just purchase. Props integrates into Slack, Salesforce, Zendesk, etc.
Maybe we are not communicating the product well on the site.
Oh I'm sorry, I forgot that provider API's are magical, mythical beasts that must be tamed from an intermediary server, never from a client.
It's OK to say "we don't think enough people will pay enough for a one-time purchase to cover development costs". You don't have to invent some kind of technical reason for it to be a service.
But having said that, I cannot believe that companies would pay the amount you're asking for a service either.
Its about recognizing people doing great things in your org. It really has become a part of the culture in the companies that have been using it (and pay) for months now.
I'm not sure if I follow. I'm currently only looking at the value for content attribute for a value. Are you saying that specifically 'nam=...' should work or it should respect any attribute name?