Linux System Programming · advanced · ~20 min

Namespaces + cgroups — the container ingredients

Understand the Linux features Docker is built from.

Overview

What containers are made of

Containers are built from two Linux kernel features:

Namespaces isolate what a process can see. Each namespace gives a process its own private view of one kernel resource: process IDs (PIDs), the network, mount points, user IDs (UIDs), inter-process communication (IPC), and the hostname.
cgroups (control groups) limit how much a process can use. They cap resources such as CPU, memory, disk I/O, and the number of PIDs a group of processes may create.

Put simply: namespaces isolate, cgroups limit. Combine the two and you get a container.

Why it matters

When you debug a container or audit a runtime, you need to know which feature controls which behavior.

This mapping makes diagnosis fast:

"CPU is being throttled" points to a cgroup limit.
"The process can't see other processes" points to a PID namespace.

Knowing which primitive owns each symptom tells you where to look first.

Core concepts

The 7 namespaces

There are seven namespace types: PID, network, mount, UTS (hostname), IPC, user, and cgroup.

You create them with the system calls unshare(2) or clone(2).

cgroups v2

cgroups version 2 uses a single mounted hierarchy under /sys/fs/cgroup.

You apply limits by writing to control files, for example:

cpu.max - CPU time budget
memory.max - memory ceiling
pids.max - maximum number of processes

User namespace

A user namespace lets a non-root user act as root inside the namespace while staying unprivileged outside it. This is the foundation of rootless containers.

Pentester mindset

Container-escape research focuses on three layers:

Shared kernel attack surface - every namespace shares the same kernel, so a kernel bug can break out of any container.
Namespace-leakage bugs - cases where isolation is incomplete and one namespace leaks into another.
Misconfigured cgroups - limits that are too loose or set incorrectly.

Most container vulnerabilities live in one of these three places.

Defensive coding habit

When you write a sandbox, document exactly which namespaces and which cgroup knobs you set.

Treat anything you did not explicitly isolate as something that can leak.

Syntax notes

#include <sched.h>

int unshare(int flags);   /* detach the caller into new namespaces */
int setns(int fd, int nstype); /* join an existing namespace by fd */

Lesson

A container is simply namespaces (isolation) plus cgroups (resource caps).

Both are kernel features. Docker stitches them together and presents them as one user-facing object.

Reading what these primitives actually do helps you understand what a container can - and cannot - protect against.

Code examples

unshare(CLONE_NEWPID | CLONE_NEWUTS | CLONE_NEWNS);
/* child sees its own PID 1, hostname, mount table */

Line by line

if (unshare(CLONE_NEWPID | CLONE_NEWNS) < 0) {
    perror("unshare"); return -1;
}
/* mount a fresh /proc inside the new mount + PID namespace */
mount("proc", "/proc", "proc", 0, NULL);

Common mistakes

Confusing namespaces (isolation) with cgroups (resource limits).

Debugging tips

Run ls -l /proc/self/ns/ to see your current namespaces. Each entry shows an inode number that identifies the namespace.

Processes that share a namespace share its inode. For example, two PIDs in the same network namespace show the same net inode.

Memory safety

This topic is unrelated to memory safety. Container escapes happen at the kernel level, not through memory bugs in your program.

Real-world uses

These primitives power every major container runtime:

Docker
containerd
runc
podman
LXC
systemd-nspawn

Each runtime is essentially a user interface built on top of namespaces and cgroups.

Practice tasks

Call unshare(CLONE_NEWUTS), then sethostname(). Confirm the parent's hostname is unaffected.
Read /sys/fs/cgroup/.../cpu.max to inspect a CPU limit.
Read /proc/self/ns/pid and compare it to a child process's PID namespace inode.

Summary

Namespaces isolate what a process can see; cgroups limit what it can use.
Together they form a container - every runtime is a UI over these primitives.
When building a sandbox, document exactly which namespaces and cgroup limits you set.
Map symptoms to the right primitive: throttling = cgroups, hidden processes = PID namespace.