Agree on the unpredictability of results issue. Tool call selection is still sort of a black box.
How do you know what variations of a prompt trigger a given tool to be called or how many tools is too many before you start seeing degradation issues because of the context window. If you are building a client and not a server the issue becomes even more pronounced.
I even extracted the Claude electron source to see if I could figure out how they were doing it, but it's abstracted behind a network request. I'm guessing the system prompt handles tool call selection.
I'm working on a coding agent, and MCP has been a frequently requested feature, but yeah this issue has been my main hesitation.
Getting even basic prompts that are designed to do one or two things to work reliably requires so much testing and iteration that I'm inherently pretty skeptical that "here are 10 community-contributed MCPs—choose the right one for the task" will have any hope of working reliably. Of course the benefits if it would work are very clear, so I'm keeping a close watch on it. Evals seem like a key piece of the puzzle, though you still might end up in combinatorial explosion territory by trying to test all the potential interactions with multiple MCPs. I could also see it getting very expensive to test this way.
I actually came across Plandex the other day. I haven't had the chance to play around with it yet, but it looked really cool.
But agree that even basic prompts can be a struggle. You often need to name the tool in the prompt to get things to work reliably, but that's an awful user experience. Tool call descriptions play a pretty vital role, but most MCP servers are severely lacking in this regard.
I hope this a result of everything being so new and the tooling and models will evolve to solve these issues over time.
Yeah, I'm still wondering if MCP will be the solution that sticks in the long run.
It has momentum and clearly a lot of folks are working on these shortcomings, so I could certainly see it becoming the de facto standard. But the issues we're talking about are pretty major ones that might need a more fundamental reimagining to address. Although it could also theoretically all be resolved by the models improving sufficiently, so who knows.
Also, cool to hear that you came across Plandex. Lmk what you think if you try it out!
1. Giving the model too many choices. If you have a lot of options (like a bunch of MCP servers) what you often see in practice is that it's like a dice roll which option is chosen, even if the best choice is pretty obvious to a human. This is even tough when you just have a single branch in the prompt where the model has to choose path A or B. It's hard to get it to choose intelligently vs. randomly.
2. Global scope. The prompts related to each MCP all get mixed together in the system prompt, along with the prompting for the tool that's integrating them. They can easily be modifying each other's behavior in unpredictable ways.
Right, my workflow to get even a basic prompt working consistently rarely involves fewer than like 10 cycles of [run it 10 times -> update the prompt extensively to knock out problems in the first step]
And then every time I try to add something new to the prompt, all the prompting for previously existing behavior often needs to be updated as well to account for the new stuff, even if it's in a totally separate 'branch' of the prompt flow/logic.
I'd anticipate that each individual MCP I wanted to add would require a similar process to ensure reliability.
This is pretty cool. You should also attempt to scan resources if possible. Similar to the tool injection attack Invariant Labs discovered, I achieved the same result via resource injection [1].
The three things I want solved to improve local MCP server security are file system access, version pinning, and restricted outbound network access.
I've been running my MCP servers in a Docker container and mounting only the necessary files for the server itself, but this isn't foolproof. I know some others have been experimenting with WASI and Firecracker VMs. I've also been experimenting with setting up a squid proxy in my docker container to restrict outbound access for the MCP servers. All of this being said, it would be nice if there was a standard that was set up to make these things easier.
One of the biggest issues I see, briefly discussed here, is how one MCP server tool's output can affect other tools later in the same message thread. To prevent this, there really needs to be sandboxing between tools. Invariant labs did this with tool descriptions [1], but I also achieved the same via MCP resource attachments[2]. It's a pretty major flaw exacerbated by the type of privilege and systems people are giving MCP servers access to.
This isn't necessarily the fault of the spec itself, but how most clients have implemented it allows for some pretty major prompt injections.
Yes, and the people involved in all this stuff have also reinvented SQL injection in a different way in the prompt interface, since it's impossible[1] for the model to tell what parts of the prompt are trustworthy and what parts are tainted by user input, no matter what delimeters etc you try to use. This is because what the model sees is just a bunch of token numbers. You'd need to change how the encoding and decoding steps work and change how models are trained to introduce something akin to the placeholders that solve the sql injection problem.
Therefore it's possible to prompt inject and tool inject. So you could for example prompt inject to get a model to call your tool which then does an injection to get the user to run some untrustworthy code of your own devising.
Yeah, you aren't far off with SQL injection comparison. That being said it's not really a fault of the MCP spec, more so with current client implementations of it.
I was in the same boat in regards to trying to find the actual JSON that was going over the wire.
I ended up using Charles to capture all the network requests. I haven't finished the post yet, but if you want to see the actual JSON I have all of the request and responses here https://www.catiemcp.com/blog/mcp-transport-layer/
Yeah, I plan on improving the formatting and adding a few more examples. There were even still some typos in the piece. To be honest, I didn't plan on sharing it yet; I just figured it might be helpful for the OP, so I shared it early.
I also think the docs are pretty good. There's just something about seeing the actual network requests that helps clarify things for me.
Oh, that's really nice. Did you capture the responses from the LLM. Presumably it has some kind of special syntax in it to initiate a tool call, described in the prompt? Like TOOL_CALL<mcp=github,command=list> or something…
Every project I do is an assertion that I don't believe the thing I make exists.
I have been unable to find a streaming forward only markdown renderer for the terminal nor have I been able to find any suitable library that I could build one with.
So I've taken on the ambitious effort of building my own parser and renderer and go through all the grueling testing that entails
the answers to that question are hugely variable and depend on the objective and defining waste. if one values learning intrinsically, like most of us here probably do, it is pretty hard to come up with a waste of time, even taking the rare break from learning.
But it seems self-evident where constraints like markets or material conditions might demarcate usefulness and waste.
Even the learners who are as happy to hear about linguistics as they are material science I presume do some opportunity cost analysis as they learn. Personally speaking, I rarely, if ever, feel like I'm wasting time per se but I always recognize and am conscious of the other things I could be doing to better maximize alternative objectives. That omnipresent consciousness may just be anxiety though I guess...
Yeah, at its core it's just a proxy, so there are a lot of other tools out there that would do the job. It does have a nice UI and I try to support projects like it when I can.
I'll check out your proxy as well, I enjoy looking at anything built around networking.
Oh yeah that's a great idea - just set it up once in minikube and use it just like you were debugging something in your production k8s stack. As much as I love k9s, sometimes I just need a trace to debug something between services
I'm personally really excited about all of the recent tooling for postgres aggregates. Definitely a pain point for a lot of developers and its easy to fall in trap where things work fine in the beginning and then query times explode as requirements change and the dataset grows. Nice to not have to spin up another DB in order to solve the problem as well.
For those who have not experimented with columnar based databases, I would highly recommend toying around with them.
The performance improvements can be substantial. Obviously there are drawbacks involved with integrating a new database into your infrastructure, so it is exciting to see columnar format introduced to Postgres. Removes the hurdle of learning, deploying and monitoring another database.
Thanks for the advice, that's actually what I'm trying to do now. I'm offering to build free mobile websites so that I can build up a portfolio for myself. Just out of curiosity how many examples of work would you be interested in seeing before you felt comfortable purchasing.
How do you know what variations of a prompt trigger a given tool to be called or how many tools is too many before you start seeing degradation issues because of the context window. If you are building a client and not a server the issue becomes even more pronounced.
I even extracted the Claude electron source to see if I could figure out how they were doing it, but it's abstracted behind a network request. I'm guessing the system prompt handles tool call selection.
PS: I released an open source evals package if you're curious. Still a WIP, but does the basics https://github.com/mclenhard/mcp-evals