Hacker Newsnew | past | comments | ask | show | jobs | submit | mlenhard's commentslogin

Agree on the unpredictability of results issue. Tool call selection is still sort of a black box.

How do you know what variations of a prompt trigger a given tool to be called or how many tools is too many before you start seeing degradation issues because of the context window. If you are building a client and not a server the issue becomes even more pronounced.

I even extracted the Claude electron source to see if I could figure out how they were doing it, but it's abstracted behind a network request. I'm guessing the system prompt handles tool call selection.

PS: I released an open source evals package if you're curious. Still a WIP, but does the basics https://github.com/mclenhard/mcp-evals


Thanks, I'll check it out.

I'm working on a coding agent, and MCP has been a frequently requested feature, but yeah this issue has been my main hesitation.

Getting even basic prompts that are designed to do one or two things to work reliably requires so much testing and iteration that I'm inherently pretty skeptical that "here are 10 community-contributed MCPs—choose the right one for the task" will have any hope of working reliably. Of course the benefits if it would work are very clear, so I'm keeping a close watch on it. Evals seem like a key piece of the puzzle, though you still might end up in combinatorial explosion territory by trying to test all the potential interactions with multiple MCPs. I could also see it getting very expensive to test this way.


I actually came across Plandex the other day. I haven't had the chance to play around with it yet, but it looked really cool.

But agree that even basic prompts can be a struggle. You often need to name the tool in the prompt to get things to work reliably, but that's an awful user experience. Tool call descriptions play a pretty vital role, but most MCP servers are severely lacking in this regard.

I hope this a result of everything being so new and the tooling and models will evolve to solve these issues over time.


Yeah, I'm still wondering if MCP will be the solution that sticks in the long run.

It has momentum and clearly a lot of folks are working on these shortcomings, so I could certainly see it becoming the de facto standard. But the issues we're talking about are pretty major ones that might need a more fundamental reimagining to address. Although it could also theoretically all be resolved by the models improving sufficiently, so who knows.

Also, cool to hear that you came across Plandex. Lmk what you think if you try it out!


Yes I agree with you. What are the major shortcomings that you can think of right now (especially the ones that we have not solved)?


I think the two biggest issues are probably:

1. Giving the model too many choices. If you have a lot of options (like a bunch of MCP servers) what you often see in practice is that it's like a dice roll which option is chosen, even if the best choice is pretty obvious to a human. This is even tough when you just have a single branch in the prompt where the model has to choose path A or B. It's hard to get it to choose intelligently vs. randomly.

2. Global scope. The prompts related to each MCP all get mixed together in the system prompt, along with the prompting for the tool that's integrating them. They can easily be modifying each other's behavior in unpredictable ways.


Makes sense. Both are hard problems I agree.


Even with proper tool call descriptions, I've had quite a few occasions where the LLM didn't know how to use the tool.

The tools provided by the MCP server were definitely in context and there were only two or three servers with a small amount of tools enabled.

It feels too model dependant at the moment, this was Gemini 2.5 Pro which is normally state of the art but has lots of quirks for tool use it seems.

Agreed on hoping models are going to be trained to be better at using MCP.


Right, my workflow to get even a basic prompt working consistently rarely involves fewer than like 10 cycles of [run it 10 times -> update the prompt extensively to knock out problems in the first step]

And then every time I try to add something new to the prompt, all the prompting for previously existing behavior often needs to be updated as well to account for the new stuff, even if it's in a totally separate 'branch' of the prompt flow/logic.

I'd anticipate that each individual MCP I wanted to add would require a similar process to ensure reliability.


This is pretty cool. You should also attempt to scan resources if possible. Similar to the tool injection attack Invariant Labs discovered, I achieved the same result via resource injection [1].

The three things I want solved to improve local MCP server security are file system access, version pinning, and restricted outbound network access.

I've been running my MCP servers in a Docker container and mounting only the necessary files for the server itself, but this isn't foolproof. I know some others have been experimenting with WASI and Firecracker VMs. I've also been experimenting with setting up a squid proxy in my docker container to restrict outbound access for the MCP servers. All of this being said, it would be nice if there was a standard that was set up to make these things easier.

[1] https://www.bernardiq.com/blog/resource-poisoning/


One of the biggest issues I see, briefly discussed here, is how one MCP server tool's output can affect other tools later in the same message thread. To prevent this, there really needs to be sandboxing between tools. Invariant labs did this with tool descriptions [1], but I also achieved the same via MCP resource attachments[2]. It's a pretty major flaw exacerbated by the type of privilege and systems people are giving MCP servers access to.

This isn't necessarily the fault of the spec itself, but how most clients have implemented it allows for some pretty major prompt injections.

[1] https://invariantlabs.ai/blog/mcp-security-notification-tool... [2] https://www.bernardiq.com/blog/resource-poisoning/


Isn't this basically a lot of hand waving that ends up being isomorphic to SQL injection?

Thats what we're talking about? A bunch of systems cobbled together where one could SQL inject at any point and there's basically zero observability?


Yes, and the people involved in all this stuff have also reinvented SQL injection in a different way in the prompt interface, since it's impossible[1] for the model to tell what parts of the prompt are trustworthy and what parts are tainted by user input, no matter what delimeters etc you try to use. This is because what the model sees is just a bunch of token numbers. You'd need to change how the encoding and decoding steps work and change how models are trained to introduce something akin to the placeholders that solve the sql injection problem.

Therefore it's possible to prompt inject and tool inject. So you could for example prompt inject to get a model to call your tool which then does an injection to get the user to run some untrustworthy code of your own devising.

[1] See the excellent series by Simon Willison on this https://simonwillison.net/series/prompt-injection/


Yeah, you aren't far off with SQL injection comparison. That being said it's not really a fault of the MCP spec, more so with current client implementations of it.


I was in the same boat in regards to trying to find the actual JSON that was going over the wire. I ended up using Charles to capture all the network requests. I haven't finished the post yet, but if you want to see the actual JSON I have all of the request and responses here https://www.catiemcp.com/blog/mcp-transport-layer/


itd be nice if you prettified your json in the blogpost

fwiw i thought the message structure was pretty clear on the docs https://modelcontextprotocol.io/docs/concepts/architecture#m...


Yeah, I plan on improving the formatting and adding a few more examples. There were even still some typos in the piece. To be honest, I didn't plan on sharing it yet; I just figured it might be helpful for the OP, so I shared it early.

I also think the docs are pretty good. There's just something about seeing the actual network requests that helps clarify things for me.


Some (many?) people learn better from concrete examples and generalize from them.


Not just that, but it's also useful to have examples to validate a) your understanding of the spec, and b) product's actual adherence to the spec.


Oh, that's really nice. Did you capture the responses from the LLM. Presumably it has some kind of special syntax in it to initiate a tool call, described in the prompt? Like TOOL_CALL<mcp=github,command=list> or something…


I had never heard of charles ... (https://www.charlesproxy.com/) I basically wrote a simple version of it 20 years ago (https://github.com/kristopolous/proxy) that I use because back then, this didn't exist ... I need to remember to toss my old tools aside


Well, Charles launched almost 20 years ago, so I'd say there's a good chance that it did exist.


Well hopefully my current thing, a streaming markdown renderer for the terminal (https://github.com/kristopolous/Streamdown) hasn't also been a waste of time


Why would anything be a waste of time?


I build things I cannot find.

Every project I do is an assertion that I don't believe the thing I make exists.

I have been unable to find a streaming forward only markdown renderer for the terminal nor have I been able to find any suitable library that I could build one with.

So I've taken on the ambitious effort of building my own parser and renderer and go through all the grueling testing that entails


the answers to that question are hugely variable and depend on the objective and defining waste. if one values learning intrinsically, like most of us here probably do, it is pretty hard to come up with a waste of time, even taking the rare break from learning.

But it seems self-evident where constraints like markets or material conditions might demarcate usefulness and waste.

Even the learners who are as happy to hear about linguistics as they are material science I presume do some opportunity cost analysis as they learn. Personally speaking, I rarely, if ever, feel like I'm wasting time per se but I always recognize and am conscious of the other things I could be doing to better maximize alternative objectives. That omnipresent consciousness may just be anxiety though I guess...


either that or "waste of time" is a meaningless phrase


Yeah, at its core it's just a proxy, so there are a lot of other tools out there that would do the job. It does have a nice UI and I try to support projects like it when I can.

I'll check out your proxy as well, I enjoy looking at anything built around networking.


even the approach that charles takes for intercepting TLS traffic is a bit old school (proxies, fake root certs etc.) - cool kids use eBPF https://mitmproxy.org/posts/local-capture/linux/


I can see how you don't need a proxy any more, but I don't see how you can bypass TLS without fake root certs, even with eBPF.


new to this program as well but looks really nice.

i think it is still a proxy though unless I’m missing something (beyond the name lol).

[here's a section on macos dealing with certs](https://mitmproxy.org/posts/local-capture/macos/)


here is one example: https://github.com/gojue/ecapture

in short, you can hook calls within SSL libraries (like OpenSSL)


Sure, but that very much depends on the application, no? What if it's statically linked its SSL lib?


you wanted to know how people are bypassing the need for a certificate with eBPF, that is how


super cool. Debugging locally across multiple microservices can be a huge pain.

A small feature request/idea - packaging this as a helm chart to make local development on something like Minkikube easier.


Oh yeah that's a great idea - just set it up once in minikube and use it just like you were debugging something in your production k8s stack. As much as I love k9s, sometimes I just need a trace to debug something between services


Congrats on the 1.0 Release, big milestone.

I'm personally really excited about all of the recent tooling for postgres aggregates. Definitely a pain point for a lot of developers and its easy to fall in trap where things work fine in the beginning and then query times explode as requirements change and the dataset grows. Nice to not have to spin up another DB in order to solve the problem as well.


> I'm personally really excited about all of the recent tooling for postgres aggregates. Definitely a pain point for a lot of developers

Could you give a few examples of what you are speaking of?


Congrats on the launch!

For those who have not experimented with columnar based databases, I would highly recommend toying around with them.

The performance improvements can be substantial. Obviously there are drawbacks involved with integrating a new database into your infrastructure, so it is exciting to see columnar format introduced to Postgres. Removes the hurdle of learning, deploying and monitoring another database.


Sapiens: A Brief History of Humankind by Yuval Noah Harari


I see quite a few people reading this on my commute. How do you find it so far?


Thanks for the advice, that's actually what I'm trying to do now. I'm offering to build free mobile websites so that I can build up a portfolio for myself. Just out of curiosity how many examples of work would you be interested in seeing before you felt comfortable purchasing.


As someone who is learning python as their first programming language this was very helpful.


Thanks!


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: