I've been building Modulo AI for the past year - an AI system that fixes GitHub issues.
Early versions took 5+ minutes to analyze a single issue.
After months of optimization, we're now sub-60 seconds with better accuracy. This presentation encapsulates what we learned about the performance characteristics of production LLM systems that nobody talks about.
- Strategies for faster token throughput.
- Strategies for quick time to first token.
- Effective context window management and
- Model routing strategies.
If you're interested in building AI agents, I'm sure you'll find some interesting insights in it!
Perhaps when LLMs introduce a lot more primitives for modifying behvavior such a programming language would be necessary.
As such for anyone working with LLMs, they know most of the work happens before and after the LLM call, like doing REST calls, saving to database, etc. Conventional programming languages work well for that purpose.
Personally, I like JSON when the data is not too huge. Its easy to read (since it is hierarchical like most declarative formats) and parse.
One pain point such a PL could address is encoding tribal knowledge about optimal prompting strategies for various LLMs, which changes with each new model release.
I would be more interested in Qodo's performance on the swe-bench-multilingual benchmark. Swe-bench-verified only includes bugs related to python breakages.
The best submission is swe-bench-multilingual is Claude 3.7 Sonnet which solves ~43% of the issues in the dataset.
Does anyone have a benchmark on the effectiveness of using embeddings for mapping bug reports to code files as opposed to extensive grepping as Qodo, Cursor and a number of tools I use do to localize faults?
Early versions took 5+ minutes to analyze a single issue.
After months of optimization, we're now sub-60 seconds with better accuracy. This presentation encapsulates what we learned about the performance characteristics of production LLM systems that nobody talks about.
- Strategies for faster token throughput.
- Strategies for quick time to first token.
- Effective context window management and
- Model routing strategies.
If you're interested in building AI agents, I'm sure you'll find some interesting insights in it!
Install and try out our Github application: https://github.com/apps/solve-bug Try Modulo via browser at: https://moduloware.ai
Here are the code examples for the presentation: https://github.com/kirtivr/pydelhi-talk
What performance issues have you been seeing in your AI agents? And how did you tackle them?