TANAY.SHAH
← FIELD REPORT/BLOG/CHOOSING-AGENT-SANDBOX-2026
// PUBLISHED 2026-05-10· 8 MIN READ

Bubblewrap, Landlock, gVisor, Firecracker: Choosing a Sandbox for AI Agent Code Execution in 2026

Anthropic uses bubblewrap for Claude Code and gVisor for Claude web. OpenAI Codex defaults to Landlock. Vercel and E2B use Firecracker microVMs. The four leading sandbox options are not interchangeable, and 'pick the strongest one' is the wrong heuristic. Here's the two-axis decision model I use, the use cases each option is right for, and why I picked bubblewrap + seccomp + AST validation for the in-house Python execution layer instead of reaching for a microVM.

If you've spent any time reading 2026 sandbox guides for AI agents you've seen the same four names: bubblewrap, Landlock, gVisor, Firecracker. They're often presented as a strength hierarchy, with microVMs at the top and namespace-based sandboxes at the bottom. That ordering is correct for blast-radius isolation. It is incomplete as a decision rule. Different production systems pick different points on the curve for defensible reasons. This post is about how I think through that decision when I'm shipping a Python execution layer for an agent.

Who picks what in 2026 (and why)

  • Anthropic Claude Code (Linux): bubblewrap + seccomp. The agent runs locally on a developer machine, the threat model is 'agent goes off the rails inside the user's own session,' the user already trusts the developer environment.
  • Anthropic Claude (web): gVisor. The agent runs in Anthropic's cloud, multi-tenant, the threat model is 'one tenant's prompt influences another tenant's execution.' Stronger isolation than namespace-based, lighter than full VM.
  • OpenAI Codex (Linux): Landlock + seccomp by default. Newer LSM API, simpler ruleset declaration, kernel-level enforcement. Same threat model as Claude Code: local developer environment, trusted host.
  • Vercel Sandbox, E2B, Modal Sandbox, Fly.io Sprites: Firecracker microVMs. The threat model is 'we don't know whose code we're running, this is multi-tenant SaaS, full kernel separation is the only acceptable answer.' Boot times: 90-200 ms across vendors. Modal uses gVisor instead and gets sub-1s.
  • OpenShell (NVIDIA, Cisco DefenseClaw, Pipelock): Landlock + seccomp + network namespaces. The 2026 wave of policy-driven agent sandboxes converges on the same primitives because Landlock is the kernel's own answer.

Same four primitives, four different points on the same curve. The pattern that emerges: the sandbox tier you pick is determined by who you're protecting against, not by the agent's capabilities.

The two axes that decide it

The decision is not 'how isolated is each option.' It's a function of two things:

  • Blast radius: if the sandbox is breached, what does the attacker reach? On a developer's local machine: the same files the developer can already touch. On a multi-tenant cloud: other customers' data. The radius determines the minimum acceptable strength.
  • Operational cost: boot time, memory overhead, syscall overhead, deploy complexity. A Firecracker microVM is the strongest by blast radius and the heaviest by ops cost. A Landlock ruleset is the lightest by ops cost and the weakest by blast radius. The right answer is the cheapest option that covers your blast radius, not the strongest option you can afford.
                        STRONGER isolation
                                 ▲
                                 │
                          ┌──────┴──────┐
                          │  Firecracker│  multi-tenant SaaS, untrusted code,
                          │  microVM    │  full kernel separation needed
                          │  ~125ms boot│
                          └─────────────┘
                                 │
                          ┌──────┴──────┐
                          │   gVisor    │  multi-tenant or shared infra,
                          │  Sentry     │  syscall interception is enough,
                          │  5-15% over │  CPU-heavy is fine
                          └─────────────┘
                                 │
                          ┌──────┴──────┐
                          │  Landlock + │  trusted host, untrusted code,
                          │  seccomp    │  modern Linux only (>=5.13)
                          │  near-zero  │  simple, declarative rulesets
                          └─────────────┘
                                 │
                          ┌──────┴──────┐
                          │  Bubblewrap │  trusted host, controlled agent,
                          │  + seccomp  │  fast boot, mature tooling,
                          │  + AST      │  most-deployed primitive
                          └─────────────┘
                                 │
                                 ▼
                        LIGHTER ops cost

Bubblewrap: lightweight, controlled-host

bubblewrap is a Linux user-namespace sandbox: no daemon, no image layers, no extra runtime to install. It's already in most Linux distros and was the basis of Flatpak before it became the default Anthropic Code Linux sandbox. Boot is essentially the cost of fork+exec plus the namespace setup, single-digit milliseconds in practice. Memory overhead is negligible.

Where it shines: developer-tools agents that run on the user's own machine, in-house code-execution layers where you control the agent surface, ZDR-compliant Python sandboxes where the threat is the agent generating wrong Python, not the agent generating malicious Python that escapes to the host. The April 2026 Claude Code escape (documented at length) was a denylist-vs-allowlist configuration error, not a bubblewrap weakness; the same configuration mistake on Landlock or gVisor produces the same outcome.

Where it doesn't: multi-tenant cloud where the tenants don't trust each other. The Linux kernel between bubblewrap'd processes is shared, and a kernel-level vulnerability (a 0-day in a syscall the agent can reach) breaks isolation between tenants. For that threat model you want a different kernel per tenant.

Landlock: cleaner API for the same niche

Landlock is the Linux kernel's own answer to the 'unprivileged process restricting itself' problem. It's a stackable LSM (Linux Security Module) that lets a process declare a ruleset of allowed paths and operations and then irreversibly apply it to itself. Kernel 5.13+ for the basics, more capabilities in later versions, network restrictions in 6.7+.

OpenAI Codex picked it as the Linux default. The pitch: the API is more declarative than bubblewrap's command-line flags, the enforcement is in the kernel rather than in a separate process, and the same primitives compose with seccomp the way they always have. The drawback in practice: kernel version requirements limit deployment (5.13 is fine for fresh-ish hosts, breaks on older long-term-support distros), and the docs / tooling ecosystem is younger than bubblewrap's. The reference tool (island) explicitly says 'not yet ready for production.'

I'd choose Landlock today on a greenfield project targeting modern Linux. I'd choose bubblewrap on anything that has to ship to a heterogenous fleet today.

gVisor: middle ground for shared infra

gVisor's pitch is conceptually distinct from the namespace-based options. Instead of restricting what an unprivileged process can do, it intercepts every syscall the application makes and reimplements it inside a Go-based user-space kernel called the Sentry. The application doesn't talk to the host kernel at all in the normal path. This buys real isolation against kernel exploits at the cost of about 5-15% syscall overhead.

The right zone for gVisor is shared multi-tenant infrastructure where you're not willing to pay full microVM cost but you want better-than-namespace isolation. Anthropic uses it for Claude web. Modal uses it for sandbox runtime. Cloud Run, App Engine, Cloud Functions are all gVisor underneath. The performance profile is workload-dependent: CPU-bound work (machine learning, data processing) is barely affected; I/O-heavy work (filesystem-and-network databases) is meaningfully slower.

I'd not choose gVisor for a single-tenant in-house agent runtime; the overhead is paying for isolation against a threat I don't have. I'd absolutely choose it for a multi-tenant SaaS where each customer's prompts trigger code execution and the customers don't know each other.

Firecracker: the microVM hammer

Firecracker boots a real Linux VM in roughly 125 ms with about 5 MB of memory overhead. This is the strongest isolation in production use today. Each tenant gets their own kernel, their own filesystem, their own networking stack. A kernel exploit in tenant A's VM doesn't reach tenant B because there is no shared kernel.

AWS Lambda, Fargate, Fly.io machines, Vercel Sandbox, E2B, Sprites, Daytona all run on Firecracker. The boot-time numbers across vendors are in a tight band (90-200 ms cold), and the practical bottleneck for further speedup is memory copy of the snapshot, not Firecracker itself. The cost: more deploy complexity (KVM-capable hosts, image management, snapshotting), and the per-tenant memory overhead, which adds up at high concurrency.

If you're building a hosted code-execution product (E2B, Modal, Vercel Sandbox), Firecracker is the right answer because the threat model demands it. If you're building an agent that runs trusted Python in your own infrastructure for your own users, Firecracker is overkill, and the operational tax pays for protection against a threat that isn't in your blast radius.

The decision tree I use

  • Step 1: who could the attacker hurt? If 'only this user themselves' (developer-machine agent): bubblewrap or Landlock is enough.
  • Step 2: do you control the agent's tool surface and prompts? If yes (in-house agent), AST validation + bubblewrap + seccomp covers ~99% of realistic threats. If no (third-party prompts in shared infra): step 3.
  • Step 3: do tenants trust each other? If yes (single-customer multi-team product): gVisor. If no (true SaaS): Firecracker.
  • Step 4: do you have to support older Linux kernels (pre-5.13)? Bubblewrap. Greenfield modern Linux: Landlock has the cleaner API.
  • Step 5: budget for ops complexity. Each tier up costs setup, observability, and a non-trivial fraction of your platform team's attention. Don't pay it unless step 1 told you to.

Why I picked bubblewrap + seccomp + AST for the in-house Python execution layer

The agent I shipped runs Python that the LLM generates, on infrastructure I control, for a small set of internal-and-trusted users. The threat model is 'the model generates wrong Python that touches files it shouldn't' or 'the model gets prompt-injected into trying to read /etc/passwd.' It is not 'an external customer is trying to break out of the VM.' For that threat model, bubblewrap with a strict mount-bind allowlist plus seccomp restricting the syscall surface to the minimum the legitimate code needs plus AST validation of the generated Python before it ever runs is the cheapest, simplest design that covers the actual blast radius.

Could I have used Firecracker? Yes. The cost would have been a per-execution VM boot (vs. fork+exec), plus image management, plus a much larger ops surface, in exchange for protection against threats not in my blast radius. The operational complexity I'd be paying for would be hardening against a hypothetical kernel 0-day exploitable by the model, while the actual frequent failure mode is the model writing slightly-wrong Python.

If the same agent ever ships to external multi-tenant SaaS, the sandbox tier moves up: gVisor for medium-trust shared infra, Firecracker for true SaaS. The architecture change is an interface boundary; the agent code itself doesn't have to change. Keeping the abstraction at 'execute_code(snippet)' over a swappable runtime is the design choice that makes the future migration cheap.

What I'd change if I were to redesign

  • Make the runtime swappable from day one. Today the layer is bwrap-specific. Adding a Landlock backend behind the same interface would let me migrate to it on modern hosts without touching the agent code.
  • Layer the AST validator earlier. The current pass runs as a step inside the sandbox process. Lifting it before the sandbox boot would catch obviously bad code without paying the namespace setup cost.
  • Add network policy as a first-class config field in the sandbox spec. Today network is binary (off / on, with the on case routing through a proxy). The right level of granularity is per-allow-rule (allow github.com/* and pypi.org/* but nothing else), and that field deserves to be declarative not procedural.

The bigger lesson

The instinct to pick the strongest option (Firecracker, microVMs, full kernel separation) is the same instinct that pushes teams toward 'we should rewrite this in Rust' or 'we need Kubernetes for this.' Sometimes the answer is yes. More often the answer is: the threat that justifies the strongest option isn't in your blast radius, and the simpler option does the job for less attention. Sandboxing is a discipline of matching the tool to the threat, not picking the tool with the strongest resume.

If a hiring manager asks me how I think about agent security, this is the framing. Not 'we use Firecracker because it's the strongest,' but 'here's the threat model, here's the option that covers it for the lowest ops cost, here's the swap path if the threat model evolves.' That's what production-grade sandbox decision-making looks like in 2026.

References

  • wincent/coding-agent-sandboxes-2026-05 (GitHub gist): exhaustive list of who uses what
  • michaellivs.com: "Why Anthropic and Vercel chose different sandboxes" (2026)
  • gVisor official performance and security guides (gvisor.dev/docs)
  • Firecracker microvm.github.io and AWS docs
  • Landlock LSM kernel documentation (docs.kernel.org/security/landlock)
  • Northflank: 'How to sandbox AI agents in 2026: MicroVMs, gVisor & isolation strategies'
  • Anthropic's Bubblewrap escape post-mortem (April 2026, Ona / Leonardo Di Donato)