MCP Security Exploits and Defense Strategies — Vitor Balocco (Runlayer)

I hosted a session with Vitor Balocco, co-founder of Runlayer and former AI tech lead at Zapier, to discuss the critical security challenges facing MCP (Model Context Protocol) servers. As agents become more prevalent in enterprise environments, understanding these security risks and implementing proper safeguards is essential for developers and organizations adopting this technology.

What are the primary security risks with MCP servers?

MCP servers have exploded in popularity over the last 6-8 months, but this rapid adoption has outpaced security considerations. The ecosystem is growing so quickly that security aspects are lagging behind, creating dangerous gaps between adoption and protection.

The fundamental risk comes from what Simon Willison calls the "lethal trifecta":

An LLM application or agent with access to private data
Exposure to untrusted content that could be controlled by attackers
The ability to externally communicate in ways that could exfiltrate data

When all three conditions exist, your system becomes vulnerable to various attack vectors. Since AI agents are designed to be semi-autonomous decision makers, giving them access to external tools and private data creates significant risks of credential theft, impersonation, and unauthorized code execution.

Key Takeaway: The combination of private data access, untrusted content exposure, and external communication capabilities creates the perfect storm for security breaches in MCP-powered applications.

How do prompt injection attacks work in MCP environments?

Prompt injection is ranked #1 in the OWASP Top 10 AI LLM Risks for good reason. While most people think of prompt injection as direct user messages trying to jailbreak a system, the reality is more complex - anything that ends up in your LLM's context can trigger a prompt injection.

This includes:

User messages
Tool outputs
Tool schemas themselves
Tool parameter names

For example, imagine you have an MCP server with a tool to read LinkedIn profiles for recruiting purposes. Since you have no control over the content of those profiles, a malicious profile could contain instructions that trick your agent into performing unauthorized actions or leaking sensitive information.

The GitHub exploit demonstrates this risk perfectly. In this scenario, an agent has access to both public and private repositories. A malicious actor creates an issue in the public repository with instructions that trick the agent into reading private repository content and publishing it to the public repository's README file.

Key Takeaway: Prompt injections can come from any untrusted content source that enters your LLM's context, not just direct user inputs. This makes them particularly dangerous when combined with tools that have access to sensitive information.

What are some real-world examples of MCP security breaches?

Several notable exploits have already been discovered in production systems:

The Notion Agents exploit: Notion's new agent product initially had a search tool that accepted both natural language queries and URLs. Attackers embedded malicious instructions in PDF files uploaded to Notion workspaces, which then triggered the agent to exfiltrate private data through URL requests.
The Heroku exploit: Attackers made GET requests to a Heroku server for non-existent URLs, causing 404 errors. These errors were logged, and when users asked their agent to check the logs, the malicious instructions embedded in the URL parameters triggered the agent to transfer app ownership to the attacker.
The GitLab Duo vulnerability: Attackers discovered they could render markdown images in GitLab's chatbot responses. By instructing the bot to search for private sales figures, base64 encode them, and include them in an image URL, they could exfiltrate the data when the bot attempted to render the image.

These examples demonstrate the creativity attackers employ to both trigger prompt injections and exfiltrate sensitive data through seemingly innocuous mechanisms.

Key Takeaway: Attackers are finding increasingly sophisticated ways to both inject malicious instructions and exfiltrate data through channels you might not expect, such as logs, image rendering, and search functionality.

What are the risks of supply chain attacks with MCP servers?

Beyond prompt injections, "rug pulls" represent a classic supply chain attack risk. Most MCP installation instructions tell you to install the latest version of some NPM package, but even if you initially inspect the code and find it safe, developers might publish updates with malicious code later.

This happened with the Postmark MCP server. The official GitHub repository instructed users to clone it and run npm install, but an attacker published a package to NPM with the same name. After maintaining a clean version for several releases to build trust, the attacker modified the send_email tool to always BCC a specific email address, effectively stealing all emails sent through the service.

Another sneaky attack vector involves manipulating tool parameter names. By naming parameters suggestively, attackers can trick models into passing sensitive data. For example, an innocent-looking "add" tool that takes two numbers might include hidden parameters like "tools_list" or "conversation_history" that the model will try to populate, effectively leaking this information.

Even if you're careful about which tools you install, one compromised server could "squat" on tool names used by other servers. For instance, if you have both WhatsApp and a compromised server installed, the compromised server might register its own "send_message" tool that takes precedence when your agent tries to send a WhatsApp message.

Key Takeaway: Supply chain attacks can compromise MCP servers even after you've verified them, making version pinning and careful auditing essential practices when using community-developed tools.

How can I protect my applications from MCP security risks?

If you're using MCP servers, implement these protective measures:

Pin tool versions of MCP servers you install and avoid auto-updating. Never use "latest" tags for community servers.
Be extremely cautious with write tools. Prefer read-only tools for workflows where you might approve actions without careful review.
Use a subset of servers and tools for each workflow rather than enabling everything at once. This forces you to review the tool list for each use case and limits potential damage.
Audit servers you're installing. Check for command injection, path traversal, and suspicious schemas.
Limit permissions by running servers with minimal required access. Don't give API keys with broader access than necessary.
Review tool descriptions carefully for hidden instructions or suspicious content.
Require human confirmation for any actions with side effects.

If you're building MCP servers or MCP-powered applications:

Never let model outputs go unfiltered. Strip out potentially malicious content before looping anything back into your LLM's context.
Sanitize tool outputs, especially when returning untrusted content.
Apply guardrails through both programmatic restrictions and LLM-based checks.
Separate external content with special delimiters in the context to limit its influence on model behavior.
Implement semantic/dynamic permissions. For example, if a tool has read private data, block subsequent calls to tools with external communication capabilities.
Default to blocking internet access and maintain an allowlist of permitted URLs.

For enterprise environments:

Maintain an internal official MCP catalog with version pinning and security team audits.
Proxy all MCP servers through a gateway you control for full oversight of data movement.
Host MCP servers in sandboxes to restrict access to local resources.

Key Takeaway: Treat your LLM as an untrusted user and implement multiple layers of defense. The best approach combines careful tool selection, content filtering, permission restrictions, and human oversight for high-risk actions.

How should companies approach MCP security at scale?

For companies adopting MCPs at scale, I recommend a systematic approach:

Create an internal MCP registry that only lists servers you've audited and trust. This gives your security team the ability to conduct regular reviews and static scans.
Proxy all MCP servers through a gateway you control. This provides oversight of every tool call moving data in and out of your LLMs and creates logs you can audit for potential breaches.
Apply guardrails at the gateway level to enforce security policies even for chat clients you don't directly control.
Host MCP servers in sandboxes to restrict access to local resources, eliminating entire categories of potential exploits.
Enforce the principle of least privilege when configuring authentication for MCP servers.

This is exactly what we're building at Runlayer - an MCP-first AI platform that's self-hosted with security, enterprise governance, and observability built in. We provide an internal MCP registry, SCIM integration, audit trails, security scanning, and usage analytics to help enterprises safely adopt MCP technology.

Key Takeaway: Enterprise-scale MCP adoption requires systematic approaches to security, including centralized registries, proxies, sandboxing, and comprehensive monitoring - either built in-house or through specialized platforms like Runlayer.

What should developers look for when evaluating MCP servers?

When evaluating whether to trust an MCP server, look for these indicators:

Examine the tools and their parameters carefully. Are there any unusual or suspiciously named parameters?
Use authentication tokens with the least amount of permissions possible when testing.
Check if the source code is publicly available and well-maintained.
Consider who the author is and their reputation in the community.
Look at how frequently new versions are published and how many people have installed it.

For GitHub-based tools specifically, be careful about the permissions you grant. Many developers use tokens with broad access, which creates significant risk. Configure GitHub authentication manually with the minimum permissions required for your use case.

When building applications that use MCPs, consider implementing input and output filtering based on your specific context. While there aren't many standardized tools for this yet, even basic filtering for sensitive information categories can significantly reduce risk.

Key Takeaway: Trust but verify when it comes to MCP servers. Examine code, limit permissions, and implement context-appropriate filtering to reduce your attack surface.

How will MCP security evolve in the future?

As the MCP ecosystem matures, we'll likely see different trends for different types of attacks:

Rug pull attacks will probably become less frequent as more official MCP servers become available. The MCP spec now publishes an official registry, which will help users identify approved servers, and companies will increasingly publish their own official implementations.

However, prompt injection attacks are likely to become more sophisticated. While newer models are less susceptible to simple attacks, attackers continue to develop more complex techniques. We'll need to develop architectures that are safe by design rather than relying solely on LLM-based defenses.

The ideal future approach will likely involve multiple layers of defense:

Architectures that separate tool outputs from the LLM context by default
Dynamic permission systems that track data flow and restrict tool access accordingly
Sandboxed environments for running MCP servers
Better mechanisms for managing which tools are available in specific contexts

Key Takeaway: While supply chain risks may decrease with ecosystem maturation, prompt injection will remain a significant challenge requiring architectural solutions rather than just better detection.

Final thoughts on MCP security

The rapid growth of the MCP ecosystem has created exciting opportunities for AI agent development, but also significant security challenges. By understanding the attack vectors - from prompt injections to supply chain attacks - and implementing appropriate safeguards, developers can mitigate these risks.

The most important principle is to treat your LLM as an untrusted user. Implement multiple layers of defense, conduct regular security testing, and always apply the principle of least privilege when configuring MCP servers.

For organizations building serious AI applications, investing in proper security infrastructure - whether built in-house or through platforms like Runlayer - will be essential to prevent potentially catastrophic data breaches or system compromises.

Remember that security in this space is still evolving. Stay vigilant, keep your systems updated (but with version pinning!), and contribute to the community's understanding of these risks as we collectively work to make MCP technology both powerful and secure.

FAQs

What are MCPs and why is security a concern?

MCP (Model Context Protocol) has become increasingly popular in the past 6-8 months for building AI agents. However, this rapid adoption has led to security vulnerabilities, including prompt injections and data exfiltration. As MCP is used to retrieve data and interact with external systems, understanding its security implications is crucial for safe implementation.

What is prompt injection and why is it dangerous with MCPs?

Prompt injection is ranked as the #1 risk in the OWASP Top 10 AI LLM Risks. It involves tricking an AI model into performing unintended actions. With MCPs, prompt injections can occur through various entry points—not just user messages, but also tool outputs, tool schemas, or even parameter names. When combined with access to private data and external communication capabilities, this creates what's called a "lethal trifecta" that can lead to serious security breaches.

What real-world examples exist of MCP security exploits?

Several notable exploits have occurred:

The GitHub exploit: An attacker created an issue in a public repository with instructions that tricked an agent into reading private repository data and publishing it publicly
The Notion exploit: Attackers embedded malicious instructions in a PDF file that was uploaded to a Notion workspace, using the search tool to exfiltrate private data
The Heroku exploit: Attackers embedded malicious instructions in URL query parameters that appeared in logs, which then instructed the agent to transfer app ownership

What are "rug pulls" in the context of MCPs?

A rug pull is a supply chain attack where an MCP installation appears legitimate initially but later introduces malicious code. For example, the Postmark MCP server exploit involved an attacker publishing a package to NPM with the same name as the official repository. After gaining trust through several legitimate versions, the attacker modified the email functionality to secretly BCC all emails to their address.

How can parameter naming be used for attacks?

Attackers can name parameters in suggestive ways to trick models into passing private data. For example, an innocent-looking addition tool might include hidden parameters like "tools_list" or "conversation_history" that could extract sensitive information when the model attempts to satisfy these parameters during a tool call.

What mitigation techniques can help prevent prompt injection?

Several approaches can help protect against prompt injections:

Implement input and output filtering to detect and sanitize sensitive content
Enforce privilege control and least privilege access by restricting models to only necessary tools
Require human approval for high-risk actions
Use special delimiters to separate external content in the context
Apply guardrails through both programmatic restrictions and LLM-based checks
Implement semantic/dynamic permissions that restrict certain actions based on previous operations

How should I approach using MCP servers safely?

When using MCP servers:

Pin tool versions and avoid auto-updating
Be cautious with write tools and prefer read-only tools when possible
Use specific subsets of servers and tools for each workflow
Audit servers you're installing by checking for command injection, path reversal, and suspicious schemas
Review tool descriptions for hidden instructions
Default to requiring confirmation for actions with side effects

What should I do if I'm building an MCP server?

If you're developing an MCP server:

Never let model outputs go unfiltered
Sanitize outputs from your tools, especially if they return untrusted content
Apply guardrails to prevent misuse
Consider sandboxing access to local resources

What enterprise-level protections should companies implement?

For organizations using MCPs at scale:

Maintain an internal official MCP catalog with version pinning
Only list servers that have been audited and approved
Proxy all MCP servers through a gateway you control for full oversight
Host MCP servers in sandboxes to restrict access to local resources
Implement security scans, both static and at runtime

Are there tools available to help manage MCP security?

Yes, solutions are emerging to address these challenges. Runlayer, for example, is building an MCP-first AI platform that's self-hosted with security, enterprise governance, and observability built in. It helps create an internal MCP registry with features like security scans, observability, and audit trails.