Pretending 16 samples is authoritative is absolutely hilarious and wild, copium this pure could kill someone.
Also working on a codebase you already know biases results in the first place -- they missed out on what has become a cornerstone of this stuff for AISWE people like me: repo tours; tree-sitter feeds the codebase to the LLM and I get to find all the stuff in the code I care about by either a single well formatted meta prompt or by just asking questions when I need to.
I'll concede one thing to the authors of the study, Claude Code is not that great. Everyone I know has moved on since before July. I personally am hacking on my own fork of Qwen CLI (which is itself a Gemini fork) and it does most of what I want with the models of my choice which I swap out depending on what I'm doing. Sometimes they're local on my 4090 and sometimes I use a frontier or larger openweights model hosted somewhere else. If you're expecting a code assistant to drop in your lap and just immediately experience all of its benefits you'll be disappointed. This is not something anyone can offer without just prescribing a stack or workflow. You need to make it your own.
The study is about dropping just 16 people into a tooling they're unfamiliar with, have no mechanical sympathy for, and aren't likely to shape and mold it to their own needs.
You want conclusive evidence go make friends with people who hack their own tooling. Basically everyone I hang out with has extended BMAD, written their own agents.md for specific tasks, make their own slash commands, "skills" (convenient name and PR hijacking of a common practice but whatever, thanks for MCP I guess). Literally what kind of dev are you if you're not hacking your own tools???
You got four ingredients here you have to keep in mind when thinking about this stuff: the model, the context, the prompt, and the tooling. If you're not intervening to set up the best combination of each for each workflow you are doing then you are just letting someone else determine how that workflow goes.
Universal function approximators that can speak english got invented and nobody wants to talk to them is not the scifi future I was hoping for when I was longing for statistical language modeling to lead to code generation back in 2014 as a young NLP practitioner learning Python for the first time.
If you can't make it work fine, maybe it's not for you, but I would probably turn violent if you tried to take this stuff from me.
I'll concede one thing to the authors of the study, Claude Code is not that great. Everyone I know has moved on since before July. I personally am hacking on my own fork of Qwen CLI (which is itself a Gemini fork) and it does most of what I want with the models of my choice which I swap out depending on what I'm doing. Sometimes they're local on my 4090 and sometimes I use a frontier or larger openweights model hosted somewhere else. If you're expecting a code assistant to drop in your lap and just immediately experience all of its benefits you'll be disappointed. This is not something anyone can offer without just prescribing a stack or workflow. You need to make it your own.
The study is about dropping just 16 people into a tooling they're unfamiliar with, have no mechanical sympathy for, and aren't likely to shape and mold it to their own needs.
You want conclusive evidence go make friends with people who hack their own tooling. Basically everyone I hang out with has extended BMAD, written their own agents.md for specific tasks, make their own slash commands, "skills" (convenient name and PR hijacking of a common practice but whatever, thanks for MCP I guess). Literally what kind of dev are you if you're not hacking your own tools???
You got four ingredients here you have to keep in mind when thinking about this stuff: the model, the context, the prompt, and the tooling. If you're not intervening to set up the best combination of each for each workflow you are doing then you are just letting someone else determine how that workflow goes.
Universal function approximators that can speak english got invented and nobody wants to talk to them is not the scifi future I was hoping for when I was longing for statistical language modeling to lead to code generation back in 2014 as a young NLP practitioner learning Python for the first time.
If you can't make it work fine, maybe it's not for you, but I would probably turn violent if you tried to take this stuff from me.