Linux System Programming · advanced · ~20 min

Namespaces + cgroups — the container ingredients

Understand the Linux features Docker is built from.

Overview

Namespaces isolate a process's view of a kernel resource: PIDs, network, mounts, UIDs, IPC, hostname. cgroups limit how much of a resource a group of processes can use: CPU, memory, IO, PIDs. Together they make containers.

Why it matters

When you debug a container or audit a runtime, you need to know which feature owns which piece. 'CPU throttled' = cgroups. 'Can't see other processes' = PID namespace.

Core concepts

The 7 namespaces. PID, network, mount, UTS (hostname), IPC, user, cgroup. Created via unshare(2) or clone(2).

cgroups v2. A single mounted hierarchy under /sys/fs/cgroup. Files like cpu.max, memory.max, pids.max are written to apply limits.

User namespace. Lets a non-root user be root inside the namespace. Foundation of rootless containers.

Pentester mindset. Container escape research focuses on: (a) shared kernel attack surface (all namespaces share the kernel); (b) namespace-leakage bugs; (c) misconfigured cgroups. Most container vulnerabilities are at one of these three layers.

Defensive coding habit. When you're writing a sandbox, document which namespaces and which cgroup knobs you set. Anything else can be assumed to leak.

Syntax notes

#include <sched.h>
int unshare(int flags);
int setns(int fd, int nstype);

Lesson

Containers = namespaces (isolation) + cgroups (resource caps). Both are kernel features Docker stitches into one user-facing thing. Reading what they actually do helps you understand what a container can and can't protect against.

Code examples

unshare(CLONE_NEWPID | CLONE_NEWUTS | CLONE_NEWNS);
/* child sees its own PID 1, hostname, mount table */

Line by line

if (unshare(CLONE_NEWPID | CLONE_NEWNS) < 0) {
    perror("unshare"); return -1;
}
/* mount a fresh /proc inside the new mount + PID namespace */
mount("proc", "/proc", "proc", 0, NULL);

Common mistakes

  • Confusing namespaces (isolation) with cgroups (resource limits).

Debugging tips

ls -l /proc/self/ns/ shows your current namespaces (inode numbers). Two PIDs in the same network namespace share net inode.

Memory safety

Unrelated to memory; container escapes are kernel-level.

Real-world uses

Docker, containerd, runc, podman, LXC, systemd-nspawn. Every container runtime is a UI on top of these primitives.

Practice tasks

  1. unshare(CLONE_NEWUTS) and sethostname; parent unaffected. 2. Read /sys/fs/cgroup/.../cpu.max. 3. Read /proc/self/ns/pid; compare to a child's.

Summary

Namespaces isolate. cgroups limit. Together they make containers. Audit which you set.

Practice with these exercises