Code-executing agents in production
Tool-calling agents often pass data and intermediate results through the model at every step, which increases token usage and slows down complex workflows.
An alternative approach is to let the model generate code that runs inside a controlled execution environment. Computation stays close to the data, and the model receives only concise results. In Anthropic's example using the Model Context Protocol (MCP), they observed reduction in token usage from roughly 150,000 tokens to about 2,000 (about 98.7%) on an agentic task.
In this post, we examine what it takes to operate this code-execution pattern in production and describe an implementation based on Python sandboxes, filesystem-mounted MCP wrappers, and Ray for distributed execution.
Rethinking tool use
When several tool definitions are placed directly in the model's context and every intermediate result is returned through the model, token usage grows quickly over the course of a workflow.
Code-execution agents change this pattern. Instead of relying on definitions embedded in the prompt, the agent explores a mounted filesystem of MCP client wrappers, imports what it needs, and runs code directly inside the sandbox. Tooling becomes part of the execution environment rather than part of the context window.
This shifts the agent's role from issuing individual tool calls to composing small, purpose-built programs.
The production gap
Running model-generated code in a single-user prototype is straightforward. Running it reliably in production, across many users and workloads, introduces several requirements.
The execution environment must isolate untrusted code. It needs consistent filesystem and network boundaries. Interpreter state must persist across turns. MCP client wrappers must be discoverable at runtime. And when generated code triggers heavier computation, the underlying system must route that work to appropriate resources.
Anthropic's post highlights the advantages of the approach. Real deployments require the infrastructure beneath it.
A practical architecture for MCP-style code execution
1. A secure, stateful execution environment
The sandbox executes model-generated code in an isolated environment. gVisor provides a secure isolation boundary between sandboxed code and the host, while avoiding the overhead of full virtual machines. Interpreter state persists across turns, allowing incremental workflows in which data is loaded once and reused.
Example:
# Load a dataset via MCP inside the sandbox
df = load_dataset("sales.csv")
df.describe()
On a later turn:
# Reuse the in-memory DataFrame
df[df["revenue"] > 1000].head()
The data remains in memory within the sandbox and does not pass back through the model.
2. MCP servers exposed as importable modules
MCP servers run externally as independent services. Their client wrappers are mounted into the sandbox as Python modules arranged in a structured directory:
/mnt/servers/
├── datasets/
│ ├── list.py
│ └── load.py
└── analytics/
└── summarize.py
Agents can inspect the filesystem:
import os
os.listdir("/mnt/servers/datasets")
Imported modules serve as the interface between sandboxed code and MCP servers. The agent discovers capabilities by exploring the filesystem rather than receiving all tool definitions in its prompt.
3. A runtime capable of routing resource-intensive tasks
When generated code invokes heavier computation, the system needs a way to route tasks to nodes with appropriate resources, such as CPU-optimized or GPU-backed workers. Distributed runtimes like Ray handle placement and scaling automatically.
The agent generates the code. The runtime determines where that code runs.
Putting the components together
A typical workflow proceeds as follows:
- The agent examines
/mnt/serversto understand available MCP tools. - It generates code to import specific modules.
- The sandbox executes that code while maintaining state across turns.
- Resource-intensive operations are routed to appropriate compute nodes.
- Only summarized results are returned to the model.
This produces an efficient loop of code generation, execution, and result summarization, with minimal token usage.
A Python implementation of this pattern
While Anthropic's examples use TypeScript, we have implemented this pattern using Python. This is well suited to analytical and data-intensive workloads.
Inside our environment, agents write Python that interacts with MCP via mounted modules:
from servers.datasets.load import load_dataset
df = load_dataset("sales.csv")
df.describe()
In later turns, state is preserved:
df[df["revenue"] > 1000]
Our implementation combines:
- gVisor-isolated Python sandboxes
- Filesystem-mounted MCP wrappers
- Autoscaled MCP servers
- Resource-aware scheduling using Ray
- Adapters for frameworks such as LangGraph
This architecture makes MCP-style code execution practical in multi-tenant production environments.
Example implementation
We have open-sourced a reference example that illustrates this pattern end to end.
The example demonstrates:
- Sandboxed Python code execution
- Persistent session state
- MCP tool discovery via filesystem
- Distributed scheduling using Ray
Running this in production
For teams looking to deploy code-executing agents without building the infrastructure from scratch, RayAI provides managed sandboxes with built-in MCP support, persistent state, and Ray-based orchestration. It handles the security, isolation, and scaling concerns described above.
If you're exploring this pattern or have questions about production deployment, reach out to us at hello@rayai.com