Linux System Programming · advanced · ~20 min
Understand the Linux features Docker is built from.
Namespaces isolate a process's view of a kernel resource: PIDs, network, mounts, UIDs, IPC, hostname. cgroups limit how much of a resource a group of processes can use: CPU, memory, IO, PIDs. Together they make containers.
When you debug a container or audit a runtime, you need to know which feature owns which piece. 'CPU throttled' = cgroups. 'Can't see other processes' = PID namespace.
The 7 namespaces. PID, network, mount, UTS (hostname), IPC, user, cgroup. Created via unshare(2) or clone(2).
cgroups v2. A single mounted hierarchy under /sys/fs/cgroup. Files like cpu.max, memory.max, pids.max are written to apply limits.
User namespace. Lets a non-root user be root inside the namespace. Foundation of rootless containers.
Pentester mindset. Container escape research focuses on: (a) shared kernel attack surface (all namespaces share the kernel); (b) namespace-leakage bugs; (c) misconfigured cgroups. Most container vulnerabilities are at one of these three layers.
Defensive coding habit. When you're writing a sandbox, document which namespaces and which cgroup knobs you set. Anything else can be assumed to leak.
#include <sched.h>
int unshare(int flags);
int setns(int fd, int nstype);
Containers = namespaces (isolation) + cgroups (resource caps). Both are kernel features Docker stitches into one user-facing thing. Reading what they actually do helps you understand what a container can and can't protect against.
unshare(CLONE_NEWPID | CLONE_NEWUTS | CLONE_NEWNS);
/* child sees its own PID 1, hostname, mount table */
if (unshare(CLONE_NEWPID | CLONE_NEWNS) < 0) {
perror("unshare"); return -1;
}
/* mount a fresh /proc inside the new mount + PID namespace */
mount("proc", "/proc", "proc", 0, NULL);
ls -l /proc/self/ns/ shows your current namespaces (inode numbers). Two PIDs in the same network namespace share net inode.
Unrelated to memory; container escapes are kernel-level.
Docker, containerd, runc, podman, LXC, systemd-nspawn. Every container runtime is a UI on top of these primitives.
Namespaces isolate. cgroups limit. Together they make containers. Audit which you set.