But, did you actually give the agent access to a tool to measure code coverage? ...

But, did you actually give the agent access to a tool to measure code coverage?

If it can't measure whether it is succeeding in increasing code coverage, no wonder it doesn't do that great a job in increasing it.

Also, it can help if you have a pair of agents (which could even be just two different instances of the same agent with different prompting) – one to write tests, and one to review them. The test-writing agent writes tests, and submits them as a PR; the PR-reviewing agent read the PR and provides feedback; the test-writing agent updates the tests in response to the feedback; iterate until the PR-reviewing agent is satisfied. This can produce much better tests than just an agent writing tests without any automated review process.