Docker Is Not a Sandbox

Why AI-Generated Code Turns Into an Infrastructure Problem

Docker is so useful that it keeps getting promoted into jobs it was never designed to do.

For packaging known applications, it is still the boring default. But when an AI system moves from generating text to executing arbitrary code, the problem stops being "how do I run this dependency tree?" and becomes "how do I contain this thing when it does something stupid?"

The primary engineering challenge is no longer inference. It is containment.

The trust boundary moves

Text generation

LLMlayer 1

Token streamlayer 2

Consumer / humanlayer 3

Model emits tokens. Output inert until consumed.

The assumption that broke

Traditional application infrastructure assumes the code has already been filtered.

The service was written by your team, reviewed, built in CI, pinned into an image, scanned badly by some tool nobody reads, and deployed through a pipeline that at least pretends to know what it is doing. The runtime is not trusted in the cryptographic sense, but it is known. You can profile it, capacity-plan around it, decide what network access it needs, and maybe write a seccomp profile if you hate yourself enough.

AI-generated code breaks that shape.

The code can be produced after the user request arrives. It can depend on prompt content, issue descriptions, README files, scraped pages, generated tests, and package docs. It may install dependencies at runtime, execute build scripts the operator has never seen, loop, fork, fill disk, exfiltrate through an allowed egress path, or accidentally delete the wrong directory because the model got confident.

None of that requires the model to be evil. "Hostile" is an operational classification, not a moral one. If nobody reviewed the program before execution, and the program can touch CPU, memory, disk, network, credentials, and tenant state, the infrastructure has to treat it like hostile multi-tenant compute.

That is why this space suddenly looks less like chatbot engineering and more like CI, serverless, edge workers, browser sandboxing, and online judge infrastructure had an annoying child together.

Why Docker became the default answer

Docker became the default answer because Docker is the default answer to almost every "run this code somewhere else" problem.

It gives you a filesystem image, process and network namespaces, cgroups, logs, environment variables, bind mounts, OCI tooling, registries, Compose, Kubernetes, and muscle memory. More importantly, it gives developers a normal Linux userspace. That matters because most generated code targets the same messy world humans target: Python, Node, Bash, apt, pip, git, ffmpeg, Playwright, SQLite, build tools, native libraries, and random transitive install scripts.

So the first AI code execution systems naturally reached for containers.

That was reasonable. If the workload is "run this trusted-ish repo's test suite for a coding agent", a container is the fastest way to get a reproducible environment. You can cap memory, set CPU shares, mount a workspace, throw the container away afterwards, and reuse existing ops machinery.

The mistake is treating that packaging boundary as a hard security boundary.

Docker optimized for packaging trusted applications. Wasm and microVMs optimize for safely executing untrusted workloads. That distinction is the entire reason these technologies are back in the conversation.

Containers isolate. They just do not isolate enough for this threat model.

The annoying part of "Docker is not a sandbox" is that it is both too simple and directionally correct.

Containers do provide isolation. Docker uses Linux namespaces and cgroups to separate process views and resource accounting, and Docker's own security docs describe namespaces as the first form of isolation between containers and the host. Docker also drops capabilities by default and can layer on AppArmor, SELinux, seccomp, user namespaces, rootless mode, read-only filesystems, no-new-privileges, and other controls. Its default seccomp profile is an allowlist that disables around 44 syscalls out of 300+ while preserving application compatibility.

That is useful. It is not a hard boundary.

A container is still a set of restrictions around processes that share the host kernel. Every syscall is a request into the kernel you are trying to protect. If the workload can trigger a kernel bug, abuse a badly namespaced feature, exploit an overbroad capability, or reach a privileged host interface like the Docker socket, the fact that the filesystem looked separate does not save you.

This is the shared-kernel problem. You are asking the host kernel to be both the common execution substrate and the isolation boundary for mutually suspicious tenants. That can be fine for internal services. It is a much uglier bet when the tenant is arbitrary generated code.

Hard boundary means something specific here: I am comfortable colocating mutually untrusted tenants on the same physical machine even when one tenant is actively trying to escape or degrade the rest of the node. Default containers are not where I would place that bet.

Container vs microVM blast radius

tenant scheduled

Container

hostile

tenant

Linux host kernel

Shared kernel. Escape touches every neighbour.

microVM

guest kernel

Hypervisor + hardware

Each guest kernel absorbs its own blast.

The boring attack is resource exhaustion

Security discussions over-focus on dramatic container escapes. Those matter, but most operators get hurt by less cinematic failures.

AI-generated workloads are excellent at creating resource uncertainty.

A generated script can recursively walk a tree, spawn too many processes, create millions of tiny files, download a model checkpoint, compile a native dependency from source, run a browser in headless mode, start a server that never exits, or write logs until the node falls over. A malicious user does not need a kernel zero-day if they can saturate CPU, burn egress, fill ephemeral storage, exhaust file descriptors, or push the host into memory pressure until the wrong process gets killed.

Cgroups help, but cgroups are not a complete product. You still need admission control, queueing, timeouts, per-tenant quotas, disk and inode limits, network policy, egress accounting, process-count limits, cleanup semantics, and a story for workloads that remain "within limit" while destroying tail latency.

This is where "just run it in Docker" turns into an infrastructure team.

The hard part is not starting one container. The hard part is safely running thousands of short-lived, partially adversarial jobs from different users. That problem already has a name: multi-tenant compute.

This is serverless again

Serverless providers ran into this wall years ago.

AWS Lambda and Fargate had to execute customer code at high density without pretending it was friendly. Firecracker exists because AWS wanted VM-like isolation with container-like density and startup behavior. The Firecracker project describes its mission as secure, multi-tenant, minimal-overhead execution for container and function workloads. AWS's original Firecracker writeup says microVMs combine traditional VM isolation with container-style resource efficiency, and AWS stated that Lambda uses Firecracker to provision sandboxes for customer code while Fargate moved tasks onto Firecracker microVMs to improve density without giving up kernel-level isolation.

AI code execution wants the same properties: fast cold starts, high density, strong tenant isolation, cheap cleanup, policy enforcement, and enough observability to debug the inevitable "the agent failed" screenshot.

Cloudflare's Workers model is another version of the same story. Their docs frame sandboxing as secure isolation plus API design. Workers use V8 isolates for dense execution, then add process-level sandboxing, cordoning, and API constraints. Their Workers for Platforms docs now explicitly mention running untrusted code written by customers or by AI in isolated sandboxes.

That is the signal. AI did not create a new infrastructure category. It pushed more products into the category serverless and edge platforms were already living in.

Why microVMs are attractive

MicroVMs are attractive because they move the isolation boundary below the guest kernel without bringing back the full baggage of traditional VMs.

A normal VM gives you a separate guest kernel, hardware virtualization, and a clean blast-radius boundary, but historically came with slow boot, high memory overhead, large images, and annoying orchestration. Firecracker, Cloud Hypervisor, and Kata-style approaches attack that cost model. Firecracker's public docs describe a minimal device model, low memory overhead, and startup in the low hundreds of milliseconds, specifically for serverless-style workloads.

For AI code execution, this is a clean trade.

You can run normal Linux inside the microVM. Python works. Native dependencies work. pip install works, for better or worse. You can even run a Docker daemon inside the VM so the agent gets Docker UX without touching the host daemon. Docker's own Docker Sandboxes documentation is a useful industry signal: their AI coding agent sandbox runs each agent inside a microVM with its own Docker daemon, filesystem, and network.

Mounting /var/run/docker.sock into an "isolated" container is not sandboxing. It is giving the workload a control plane for the host.

MicroVMs are not free. You now own guest kernels, rootfs images, snapshotting, VM networking, block devices, patching, boot optimization, metrics across a virtualization boundary, and a scheduler that understands a heavier unit than a process. Warm pools and snapshots help, but they are machinery, not magic.

Still, for truly untrusted code, that cost can be rational. The question is not "is a microVM slower than a container?" The question is "what is the cost of one tenant escaping a shared-kernel node?"

Why gVisor keeps showing up in the middle

gVisor is the compromise you reach when you want container ergonomics but do not want arbitrary applications talking directly to the host kernel syscall surface.

The project describes itself as an application kernel. Its runsc runtime integrates with Docker and Kubernetes, but system interfaces normally implemented by the host kernel move into a per-sandbox userspace kernel called Sentry. Application syscalls are intercepted and handled by gVisor rather than passed straight through to the host kernel. The result is not a normal VM and not just a seccomp filter. It keeps much of the container workflow, reduces host kernel exposure, and pays some compatibility and syscall-overhead cost.

That middle ground is useful.

Some workloads do not justify full microVM overhead. Some platforms are already deeply Kubernetes-shaped. Some teams need a migration path where RuntimeClass: gvisor is less politically impossible than building a VM scheduler.

But gVisor is also not magic. The docs are honest about tradeoffs: syscall-heavy workloads can perform poorly, and compatibility is not identical to Linux. Stronger boundaries usually mean less compatibility, more overhead, or more operational complexity. Sometimes all three. Sorry.

Why Wasm is attractive

Wasm keeps coming back because its default posture is closer to what untrusted execution wants.

A WebAssembly module does not start with a POSIX process model and then ask Linux to take things away. It starts with a small sandboxed execution model. The official WebAssembly security docs describe modules executing in a sandbox separated from the host runtime, with access mediated through APIs. Wasmtime's security docs make the same point more directly: WebAssembly is designed to execute untrusted code safely, with no raw syscall access and all outside-world interaction going through imports and exports.

WASI adds the part systems people care about: capabilities. WASI's design principles say access to external resources is provided by capabilities, with no ambient authorities. That is a very different shape from a Linux process inheriting a user identity, environment, filesystem view, network stack, and whatever else the parent process accidentally left around.

For narrow workloads, this is beautiful.

If the task is "run this user-defined transform", "execute this plugin", "evaluate this policy", "process this request at the edge", or "run a small generated function with bounded inputs and outputs", Wasm is an excellent fit. Startup can be fast, memory can be bounded cleanly, the host API can be tiny, and the runtime can say: "there is no filesystem API here, so you cannot use the filesystem."

That sentence is stronger than "we mounted a filesystem but wrote a policy we hope covers all weird paths."

Capability grant model

0 / 3 granted

Wasm module

linear memoryalways

cpu time slicealways

everything else: deny by default

No filesystem, network, env, or syscalls until the host hands a capability over.

read /workspace/input—

write /workspace/output—

https → api.example.com—

process env vars—

raw TCP—

parent dir traversal—

Compare with a Linux process: it inherits a filesystem view, env, $PATH, network reachability, and credentials the moment it starts. Authority subtraction is harder than authority granting.

Do not oversell Wasm

This is where people get carried away.

Wasm is not replacing Docker for general AI code execution. Not soon, and maybe not ever in that broad form.

The reason is annoyingly practical: Python won a lot of the workloads that AI systems actually execute. The 2025 Stack Overflow Developer Survey had Python at 57.9% usage among respondents, and JetBrains' State of Python 2025 reported heavy use across data processing and machine learning. Those ecosystems are full of native extensions, wheels, subprocess assumptions, dynamic linking, filesystem expectations, and libraries written for "normal Linux", not for a carefully shaped capability runtime.

Ugly workloads look like this:

pip install pandas numpy scipy shapely playwright
playwright install chromium
python generated_analysis.py

Or worse, they need apt, gcc, node-gyp, CUDA libraries, ffmpeg, libxml2, headless browsers, package post-install scripts, and a filesystem that behaves like the one every open-source maintainer assumed existed.

Wasm can support more than people give it credit for, and WASI/component-model work is moving in the right direction. But native dependency pain is real. Tooling maturity varies. Debugging can be worse. Networking and filesystem APIs are still not equivalent to Linux. Language support is uneven. If your product promise is "paste any GitHub repo and let an agent fix it", Wasm is probably not your primary execution substrate.

That is fine.

Wasm does not need to replace containers to matter. It needs to take the parts of the workload that benefit from a small, explicit, capability-based sandbox. MicroVMs can take the messy Linux-shaped parts. Containers can remain the packaging format inside those boundaries.

The winning architecture is not "Wasm vs Docker". It is "which trust boundary belongs around this workload?"

Capability-based security is the real architectural shift

The important idea underneath Wasm, Workers, Docker Sandboxes, and most serious agent sandboxes is not the specific runtime. It is the move away from ambient authority.

Ambient authority is what normal processes inherit by default: a filesystem view, environment variables, network reachability, binaries on $PATH, local services, and whatever credentials the parent process accidentally left around. Security then tries to subtract pieces from that world.

Capability-based design flips this. The workload starts with almost nothing. It receives specific handles, APIs, proxies, domains, files, directories, credentials, or tools. Those capabilities are the product surface.

This matters because the model is not the only attacker. A repo can prompt-inject through documentation. A package install script can attack the runtime. A web page can attack a browser agent. A generated test can attack the next generated command. If credentials are ambient, the agent will eventually print them, upload them, or bake them into a file that gets committed because of course it will.

A serious execution environment therefore treats every external interaction as a capability: scoped workspace access, deny-by-default egress, proxy-side credential injection, controlled package mirrors, explicit CPU/memory/disk/process budgets, private Docker daemons, and narrow artifact export paths.

This is not generic "security best practice". This is table stakes once generated code gets a shell.

Agent sandbox control plane

idle

user request

prompt + repo context

orchestrator

tenant id · policy · budget

scheduler

route by workload class

sandbox boundary

Wasm

gVisor

microVM

agent process

generated code

private fs + dockerd

snapshot per job

network proxy

deny-by-default egress

credential broker

proxy-side injection

package mirror

pinned + cached

artifact store

controlled export

untrusted code

policy

controlled capability

data path

AI infrastructure is converging with cloud infrastructure

A lot of AI infrastructure discussion still over-indexes on model serving: GPUs, batching, KV cache, quantization, speculative decoding, routing, evals, and cost per token. Those are real problems. But agentic systems add another control plane next to inference: the code execution plane.

That execution plane looks suspiciously like cloud infrastructure.

It needs schedulers, isolation domains, warm pools, image distribution, snapshotting, resource metering, tenant identity, network policy, secrets management, artifact storage, audit logs, garbage collection, abuse controls, and quota enforcement. It has to handle noisy neighbors, cold starts, stuck jobs, filesystem cleanup, cache poisoning, dependency supply chain risk, and workloads that only fail under concurrency.

In other words, the AI platform becomes a small serverless provider whether it wants to or not.

This is why Firecracker, gVisor, Wasm, V8 isolates, hardened containers, and capability sandboxes are resurfacing. The workload shape changed in a way that makes their original tradeoffs valuable again.

Containers will still be everywhere because they are too useful as a packaging abstraction. But for untrusted AI execution, containers are increasingly not the outer boundary. They are a thing you run inside the boundary.

The cynical but useful rule is:

Use containers when you trust the application and need packaging. Use Wasm or microVMs when you distrust the workload and need containment.

The practical architecture

I would not start with "which sandbox is coolest?" I would classify workloads by Linux compatibility and authority.

Small pure functions, plugins, policy evaluation, request transforms, and generated code with narrow inputs should bias toward Wasm or an isolate-style runtime. Do not give code a filesystem just because POSIX nostalgia says every program deserves one.

Messy repo-level tasks should bias toward microVMs. If the agent needs to clone a repo, install dependencies, run tests, start services, execute browsers, or run Docker Compose, give it Linux inside a hard boundary. Keep Docker inside the VM. Snapshot the workspace. Proxy the network. Kill the VM when done.

Kubernetes-native platforms can look at gVisor or Kata-style runtimes. Hardened containers still have a place for trusted jobs, internal automation, one-tenant environments, and low-risk code paths. Once untrusted tenants share nodes, "we set --cap-drop=ALL" is not a complete story.

The trap is building one universal runtime. That usually produces either an unsafe general-purpose container system or an overbuilt VM platform for code that could have been a small Wasm module.

A realistic stack looks like this:

Wasm / isolates        -> narrow APIs, small generated functions, plugins
hardened containers    -> trusted jobs, internal automation, packaging convenience
gVisor / Kata          -> OCI compatibility with stronger isolation
microVMs               -> arbitrary Linux-shaped untrusted workloads
full VMs / dedicated   -> high-risk tenants, compliance, long-lived state, GPUs

The details vary, but the direction is hard to ignore.

The boring conclusion

Docker is not bad. Docker is just being asked to do a job that belongs to a different layer.

Containers are excellent at packaging trusted applications. They are not, by themselves, a sufficient answer for arbitrary AI-generated code in a hostile multi-tenant environment. The shared kernel matters. Resource exhaustion matters. Ambient credentials matter. Network egress matters. Cleanup matters. Tenant scheduling matters. The Docker socket definitely matters.

The resurgence of Wasm, microVMs, Firecracker, gVisor, and hardened execution environments is not random infrastructure fashion. It is the industry rediscovering a boundary problem.

When the workload is trusted, optimize for deployment speed and density.

When the workload is untrusted, optimize for containment first, then claw back speed and density.

LLMs did not invent untrusted execution problems.

They simply made them mainstream again.