> I know this is an anecdote, but try to break down the problem you have in simpler terms and it may work.
This is an expected outcome of how LLMs handle large problems. One of the "scaling" results is that the probability of success depends inversely on the problem size / length / duration (leading to headlines like "AI can now automate tasks that take humans [1 hour/etc]").
If the problem is broken down, however, then it's no longer a single problem but a series of sub-problems. If:
* The acceptance criteria are robust, so that success or failure can be reliably and automatically determined by the model itself,
* The specification is correct, in that the full system will work as-designed if the sub-parts are individually correct, and
* The parts are reasonably independent, so that complete components can be treated as a 'black box', without implementation detail polluting the model's context,
... then one can observe a much higher overall success rate by taking repeated high-probability shots (on small problems) rather than long-odds one-shots.
To be fair, this same basic intuition is also true for humans, but the boundaries are a lot fuzzier because we have genuine long-term memory and a lifetime of experience with conceptual chunking. Nobody is keeping a million-line codebase in their working memory.
This is an expected outcome of how LLMs handle large problems. One of the "scaling" results is that the probability of success depends inversely on the problem size / length / duration (leading to headlines like "AI can now automate tasks that take humans [1 hour/etc]").
If the problem is broken down, however, then it's no longer a single problem but a series of sub-problems. If:
* The acceptance criteria are robust, so that success or failure can be reliably and automatically determined by the model itself, * The specification is correct, in that the full system will work as-designed if the sub-parts are individually correct, and * The parts are reasonably independent, so that complete components can be treated as a 'black box', without implementation detail polluting the model's context,
... then one can observe a much higher overall success rate by taking repeated high-probability shots (on small problems) rather than long-odds one-shots.
To be fair, this same basic intuition is also true for humans, but the boundaries are a lot fuzzier because we have genuine long-term memory and a lifetime of experience with conceptual chunking. Nobody is keeping a million-line codebase in their working memory.