Safe Penetration Testing Labs · advanced · ~30 min

seccomp-BPF — allow-list the system calls a process may make

By the end of this lesson you will be able to: - Explain what **seccomp-BPF** is, where the filter lives (in the kernel, on the syscall path), and why it is a *defense-in-depth* control rather than a primary defense. - Build a tight **allow-list** filter with `libseccomp` (`seccomp_init`, `seccomp_rule_add`, `seccomp_load`) and choose a safe **default action** (`SCMP_ACT_KILL_PROCESS` vs `SCMP_ACT_ERRNO`). - Discover the *real* syscall set a program needs using `strace -c`, including the "invisible" syscalls (`exit_group`, `rt_sigreturn`, `brk`) that runtime and libc use behind your back. - **Verify a sandbox**: prove that a blocked syscall is actually rejected (SIGSYS / `EPERM`) and that allowed syscalls still succeed — in a local, authorized lab only. - Read an existing seccomp profile the way an authorized auditor does: find syscalls that are permitted but shouldn't be, and reason about escape routes without exploiting anything. - Log and detect seccomp violations so a real system can alert on blocked-syscall events.

Overview

Security objective. The asset you are protecting is the host kernel's system-call interface — the boundary between an untrusted user-space process and the full power of the operating system. The threat is a process that has been compromised (through a memory-safety bug, a deserialization flaw, a malicious plugin) and now tries to do something it was never meant to do: spawn a shell (execve), open a network socket (socket/connect), load a kernel module, or read arbitrary files. seccomp lets the process voluntarily give up the syscalls it does not need, so that even a fully hijacked process is boxed in. In this lesson you will detect and prevent out-of-policy syscalls, and verify that the box actually holds.

What it is. seccomp (secure computing mode) attaches a small BPF (Berkeley Packet Filter) program to a process's syscall path. BPF here is a tiny, safe, in-kernel rule language — the same idea as packet filters, but the "packets" are syscalls. For every syscall the process makes, the kernel runs the filter, which inspects the syscall number (and optionally argument registers) and returns one action: allow, kill the process/thread, return an errno, trap (raise SIGSYS), notify a supervisor, or log.

Where it fits. This builds directly on your prereqs. In linux-syscalls you learned that every privileged action a program takes — reading a file, allocating memory, starting a process — funnels through a numbered syscall. seccomp is a gate on exactly that funnel. In rlimit — capping CPU, memory, fds you saw a process voluntarily restrict its own resources with setrlimit; seccomp is the same self-restriction philosophy applied to capabilities instead of quantities. Both are things a process does to itself, early, before it touches untrusted input — and both are enforced by the kernel, so a later exploit cannot undo them.

Where it is used. seccomp-BPF is the backbone of nearly every modern sandbox: Docker/containerd apply a default profile to containers, Chrome and Firefox confine renderer/content processes, OpenSSH uses it in its privilege-separated child, and systemd exposes it via SystemCallFilter=. Learning to read and write these filters is core to both hardening systems and auditing them.

Why it matters

In authorized professional work, seccomp shows up on both sides of the table.

As a defender / platform engineer, seccomp is one of the cheapest high-value hardening steps available. A single filter, installed at process start, means that a remote-code-execution bug discovered next year still cannot call execve to get a shell or socket to exfiltrate data. You are shrinking the blast radius of vulnerabilities you don't even know about yet. Regulated environments (PCI, SOC 2, container-security benchmarks like the CIS Docker Benchmark) increasingly expect syscall filtering on internet-facing workloads.

As an authorized penetration tester / sandbox auditor, reading the seccomp profile is how you assess whether a sandbox is meaningful or theatrical. A profile that still allows ptrace, mount, bpf, unshare, keyctl, or add_key may be trivially escapable. Your job is to report those gaps with evidence and a remediation — not to weaponize them. "The container claims to be sandboxed" is a marketing statement; "the seccomp profile permits unshare(CLONE_NEWUSER) which enables a known privilege-escalation path" is a finding.

Critically, seccomp is defense-in-depth, never a sole defense. It does not fix the bug that let an attacker run code; it limits what that code can do next. Claiming a service is "secure because it has seccomp" is exactly the kind of overstatement this course teaches you to avoid.

Core concepts

1. seccomp modes: strict vs filter (BPF)

Definition. The original SECCOMP_MODE_STRICT allows only four syscalls (read, write, _exit, sigreturn) — almost unusably tight. The modern SECCOMP_MODE_FILTER lets you attach a custom BPF program that decides per-syscall. This lesson is entirely about filter mode.

Why it works. The filter is evaluated in the kernel, on the syscall entry path, before the syscall executes. A compromised process cannot bypass it because the process itself does not run the check — the kernel does.

When / when not. Use filter mode whenever a process has a small, predictable syscall footprint (parsers, media decoders, network daemons, code-execution sandboxes). It is a poor fit for programs that legitimately need a huge, dynamic syscall surface (a general-purpose shell, a package manager).

Pitfall. A filter is inherited across fork and preserved across execve, and once installed with NO_NEW_PRIVS it cannot be removed or loosened — only tightened. Install it too early or too broadly and you break the program; the fix is always to add the missing syscall, never to disable the filter.

2. Default-deny (allow-list), not deny-list

Definition. A default action applies to any syscall you did not write an explicit rule for. Setting the default to kill/deny and then explicitly allowing a known-good set is an allow-list (a.k.a. default-deny). The opposite — allow everything, then block a few bad syscalls — is a deny-list.

Plain explanation. Deny-lists are almost always wrong for security: you can only block the bad things you thought of, and the kernel adds new syscalls over time. Allow-lists fail closed — an unforeseen syscall is denied by default.

How it works. seccomp_init(SCMP_ACT_KILL_PROCESS) sets the default to "kill," then each seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(x), 0) punches a hole for one syscall.

When not. Allow-lists take work to build and can break on library upgrades that start using a new syscall. That maintenance cost is the price of failing closed — budget for it.

Pitfall. Building the list from memory. You will always miss the "invisible" syscalls (see concept 4). Build it from a trace.

3. Choosing the default action: kill vs errno

Default action	Effect on a blocked syscall	Best for
`SCMP_ACT_KILL_PROCESS`	Whole process dies with SIGSYS	Sandboxes where any violation is an attack; fail-closed
`SCMP_ACT_KILL`	Offending thread dies (older, riskier)	Rarely preferred; can leave a wounded process
`SCMP_ACT_ERRNO(EPERM)`	Syscall returns `-1`, `errno=EPERM`	Graceful degradation; programs that probe optional syscalls
`SCMP_ACT_LOG`	Allowed, but logged	Discovery phase only — never a production default
`SCMP_ACT_TRAP`	Raises SIGSYS you can catch	Custom handling / telemetry

Pitfall. SCMP_ACT_LOG as the default means everything is allowed. It is a fantastic tool for learning which syscalls a program uses, and a catastrophic thing to ship.

4. The "invisible" syscalls

Even a program that only prints a line makes ~10–20 syscalls it never wrote: brk/mmap/mprotect (memory + loader), futex (locks), rt_sigreturn (returning from a signal), exit_group (normal termination), fstat/ioctl (stdio setup), plus glibc's arch_prctl, set_robust_list, rseq, and getrandom on modern systems. Forgetting these is the single most common cause of a filter that "randomly" kills the process. Always derive the list from strace -c, not from intuition.

5. Reading a filter as an auditor

An allow-list is a policy document. Auditing it means asking: does this process need what it is allowed? A media decoder that is allowed socket, connect, execve, or ptrace has a policy far wider than its function — each is a potential escape or exfiltration route. You report these; you do not exploit them against systems you were not authorized to test.

THREAT MODEL — seccomp-filtered worker process

            UNTRUSTED INPUT (network / file / IPC)
                         |
                         v
  +----------------------------------------------+
  |  Worker process (user space)                 |  <-- may be compromised
  |  - parses attacker-controlled data           |
  |  - installs seccomp filter BEFORE parsing    |
  +----------------------------------------------+
                         |  every syscall
   TRUST BOUNDARY ===========================  <-- seccomp-BPF gate (in kernel)
                         |  allow / kill / errno
                         v
  +----------------------------------------------+
  |  KERNEL — the protected asset                |
  |  syscall table: execve, socket, ptrace,      |
  |  mount, bpf, keyctl, init_module, ...         |
  +----------------------------------------------+

  Entry point:   the syscall instruction in the worker
  Assumption that fails without seccomp:
     "code running in this process only does what the
      source code intends" — false once it is exploited.
  With a default-deny filter, a hijacked worker can reach
  only the handful of syscalls on the allow-list.

Knowledge check.

What asset is protected by the seccomp gate, and where exactly is the trust boundary drawn?
A worker that only decodes images is allowed socket and connect. What insecure assumption does that policy encode, and why is it a finding?
Which log/kernel signal tells you a blocked syscall was attempted, and why must you test all of this only in an authorized lab (localhost / container / disposable VM)?

Syntax notes

The high-level libseccomp API (link with -lseccomp) builds the raw BPF for you. The essential calls:

#include <seccomp.h>   /* libseccomp; also <linux/seccomp.h> for raw prctl */

/* 1. Create a filter context with the DEFAULT action for
 *    any syscall you do not explicitly allow below. */
scmp_filter_ctx ctx = seccomp_init(SCMP_ACT_KILL_PROCESS);

/* 2. Punch one hole in the default: allow syscall `read`.
 *    arg_cnt = 0 means "no argument constraints — any args". */
seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(read), 0);

/* 2b. Argument-filtered rule: allow write ONLY to fd 1 (stdout).
 *     SCMP_A0 = argument 0; SCMP_CMP_EQ = equals. */
seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(write), 1,
                 SCMP_A0(SCMP_CMP_EQ, 1));

/* 3. Compile + install the filter into the kernel.
 *    From this point on, the policy is live and irreversible. */
seccomp_load(ctx);

/* 4. Free the builder context (the loaded filter stays active). */
seccomp_release(ctx);

Key points:

Call seccomp_load after all rules are added, and before you touch untrusted input.
SCMP_SYS(name) resolves the syscall name to the right number for the build architecture.
With libseccomp, NO_NEW_PRIVS is set for you (raw prctl users must set PR_SET_NO_NEW_PRIVS themselves, or seccomp_load will fail without privileges).
Argument filtering (SCMP_A0..A5, SCMP_CMP_*) can only inspect scalar register values, never the contents of a pointer — a TOCTOU trap discussed in Mistakes.

Lesson

seccomp-BPF lets a process attach a BPF (Berkeley Packet Filter) rule to its own syscall path. From that point on, every syscall it makes is checked against the filter.

The standard pattern is simple:

Build a filter with libseccomp.
Allow only the syscalls your program actually uses (often around 20).
Set the default so any other syscall kills the process.

This is the same technique behind Docker, Chrome's sandbox, and OpenSSH's privilege separation (privsep).

Code examples

The shape below is INSECURE → SECURE → VERIFY. All three run only on your own machine / a disposable VM.

(1) WARNING: intentionally vulnerable — use only in a local, isolated, authorized lab. Do not deploy.

This worker parses input and then, if the input says so, spawns a shell. With no seccomp filter, an attacker who controls the input owns the box.

/* vuln_worker.c  —  NO sandbox. Demonstration of the risk only. */
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>

int main(void) {
    char line[128];
    if (!fgets(line, sizeof line, stdin)) return 0;
    line[strcspn(line, "\n")] = '\0';

    /* Imagine this branch is reachable via a bug, not by design. */
    if (strcmp(line, "pwn") == 0) {
        char *argv[] = { "/bin/sh", NULL };
        execve("/bin/sh", argv, NULL);   /* attacker gets a shell */
        perror("execve");                /* only reached if execve fails */
        return 1;
    }
    printf("processed: %s\n", line);
    return 0;
}

(2) SECURE — same worker, sandboxed with a default-deny allow-list

execve, socket, and everything else are not on the list, so the pwn branch dies instead of giving a shell.

/* safe_worker.c  —  build: cc -O2 -Wall -Wextra safe_worker.c -lseccomp -o safe_worker */
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <errno.h>
#include <seccomp.h>

/* Install a tight allow-list. Returns 0 on success, -1 on failure. */
static int install_sandbox(void) {
    /* Default: kill the whole process on any un-allowed syscall. */
    scmp_filter_ctx ctx = seccomp_init(SCMP_ACT_KILL_PROCESS);
    if (!ctx) { fprintf(stderr, "seccomp_init failed\n"); return -1; }

    /* Syscalls THIS program actually needs. Derive real lists with
     * `strace -c`; this set covers fgets/printf + clean exit on
     * a typical x86-64 glibc system. */
    int allow[] = {
        SCMP_SYS(read), SCMP_SYS(write),
        SCMP_SYS(fstat), SCMP_SYS(newfstatat),
        SCMP_SYS(brk), SCMP_SYS(mmap), SCMP_SYS(munmap), SCMP_SYS(mprotect),
        SCMP_SYS(exit_group), SCMP_SYS(rt_sigreturn),
    };
    int rc = 0;
    for (size_t i = 0; i < sizeof allow / sizeof allow[0]; i++)
        rc |= seccomp_rule_add(ctx, SCMP_ACT_ALLOW, allow[i], 0);
    if (rc != 0) { fprintf(stderr, "rule_add failed\n"); seccomp_release(ctx); return -1; }

    if (seccomp_load(ctx) != 0) {          /* filter goes live here */
        fprintf(stderr, "seccomp_load: %s\n", strerror(errno));
        seccomp_release(ctx);
        return -1;
    }
    seccomp_release(ctx);                   /* loaded filter stays active */
    return 0;
}

int main(void) {
    if (install_sandbox() != 0) return 2;   /* fail closed: refuse to run unsandboxed */

    char line[128];
    if (!fgets(line, sizeof line, stdin)) return 0;
    line[strcspn(line, "\n")] = '\0';

    if (strcmp(line, "pwn") == 0) {
        char *argv[] = { "/bin/sh", NULL };
        execve("/bin/sh", argv, NULL);      /* NOT on allow-list -> SIGSYS, process killed */
        perror("execve");                   /* never reached */
        return 1;
    }
    printf("processed: %s\n", line);        /* write() IS allowed */
    return 0;
}

(3) VERIFY — prove the fix rejects bad input and accepts good input

# Build
cc -O2 -Wall -Wextra safe_worker.c -lseccomp -o safe_worker

# ACCEPT good input: normal path still works.
printf 'hello\n' | ./safe_worker
#   -> processed: hello        (exit status 0)

# REJECT bad input: the execve escape is blocked by the kernel.
printf 'pwn\n' | ./safe_worker
echo "exit status: $?"
#   -> Bad system call (core dumped)
#   -> exit status: 159         (128 + 31; signal 31 = SIGSYS)

# Confirm the kernel recorded a seccomp kill for the exact syscall.
dmesg | tail -n 3
#   -> audit: ... comm="safe_worker" ... syscall=59 ...   (59 = execve on x86-64)

Expected result. Good input prints processed: hello and exits 0 — the allowed write and read work fine. The pwn input never reaches a shell: the kernel sees an execve that is not on the allow-list, sends SIGSYS, and the process dies (shell reports Bad system call, exit status 159). The dmesg line is your detection evidence that a blocked syscall was attempted. This is the whole point: the escape route is closed by the kernel, not by the (already-hijacked) program.

Line by line

Walkthrough of safe_worker.c and the verify step.

seccomp_init(SCMP_ACT_KILL_PROCESS) — creates the filter builder with the default = kill. At this instant the policy is "kill on everything"; the following rules only loosen it for named syscalls.
The allow[] array lists the syscalls the program genuinely needs. read/write service stdio; fstat/newfstatat/brk/mmap/munmap/mprotect are the loader + malloc arena; rt_sigreturn lets signal handlers return; exit_group performs the normal return/exit. These are the "invisible" syscalls from the concepts section — omit one and the program dies during startup or shutdown, not where you expect.
The loop calls seccomp_rule_add(ctx, SCMP_ACT_ALLOW, allow[i], 0) once per syscall. arg_cnt = 0 means "allow with any arguments." rc |= ... accumulates any error so one failed add is caught.
seccomp_load(ctx) — libseccomp compiles the rules to BPF, sets NO_NEW_PRIVS, and installs the filter. From this line on, the sandbox is live and irreversible for the life of the process. Everything after runs boxed.
if (install_sandbox() != 0) return 2; — fail closed. If the sandbox can't be installed, the program refuses to process input rather than running unprotected.
Good input hello: printf ultimately calls write(1, ...), which is on the list → allowed → processed: hello, exit 0.
Malicious input pwn: control reaches execve("/bin/sh", ...). execve (syscall 59 on x86-64) is not on the list, so the kernel matches the default action and sends SIGSYS, killing the process before the shell is created.

Trace of the two runs:

Input	Reaches branch	Syscall attempted	On allow-list?	Kernel action	Observable result
`hello`	print branch	`write(1,...)`	yes	run	`processed: hello`, exit 0
`pwn`	escape branch	`execve(...)` (nr 59)	no	SIGSYS / kill	`Bad system call`, exit 159, `dmesg` audit line

The key value that "changes" is the process's fate: for write the syscall proceeds; for execve the process is terminated at the syscall boundary, so the injected shell command never runs.

Common mistakes

1. Building the allow-list from memory.

Wrong: allow only the syscalls you "call" (read, write) and load the filter.
Why wrong: libc and the runtime make many syscalls you never wrote (brk, mmap, rt_sigreturn, exit_group, futex, getrandom). The program dies at startup or exit with a confusing SIGSYS.
Corrected: run strace -f -c ./prog under representative input, take the syscall column as your starting list, then trim.
Recognise/prevent: SIGSYS very early or exactly at exit is the signature of a missing invisible syscall.

2. Loading the filter too late.

Wrong: parse the untrusted request, then call seccomp_load.
Why wrong: the exploit fires during parsing, before the box is closed.
Corrected: install the sandbox as one of the first things in main, after only the setup the filter itself needs.
Recognise/prevent: code-review rule — seccomp_load must dominate every path that reads attacker data.

3. Using a deny-list instead of an allow-list.

Wrong: default SCMP_ACT_ALLOW, then block execve and socket.
Why wrong: fails open — every syscall you forgot (and every new kernel syscall) is permitted. Blocking execve while leaving execveat open is a classic bypass.
Corrected: default-deny (SCMP_ACT_KILL_PROCESS), allow the known-good set.
Recognise/prevent: if your filter's default is ALLOW, treat it as unfinished.

4. Trusting pointer-argument filters (TOCTOU).

Wrong: seccomp_rule_add(..., SCMP_SYS(open), 1, SCMP_A0(SCMP_CMP_EQ, (scmp_datum_t)"/safe")) expecting to allow only path /safe.
Why wrong: seccomp compares the pointer register value, not the string it points to. Even a numeric compare on a pointer is a time-of-check/time-of-use race: another thread can change the buffer after the check. seccomp cannot dereference user memory.
Corrected: filter only on scalar args (fd numbers, flags, syscall selection). Enforce path policy with the filesystem/namespaces, not seccomp.
Recognise/prevent: if a rule dereferences a pointer to be correct, it is unsound.

5. Treating seccomp as the whole defense / overclaiming.

Wrong: "we added seccomp, the service is secure."
Why wrong: seccomp limits post-exploit capability; it does not stop the bug. Nothing is "completely secure," and passing a scanner or having a filter does not prove safety.
Corrected: combine seccomp with input validation, least privilege, namespaces, and memory-safety work; describe it as one layer.

Debugging tips

Symptom: the program dies with Bad system call (SIGSYS).

Identify the offending syscall: dmesg | tail shows audit: ... syscall=NN .... Map the number with ausyscall NN (from auditd) or grep NN against /usr/include/asm/unistd_64.h.
Reproduce and enumerate what's needed: strace -f -c ./prog < input prints a syscall histogram; compare it against your allow-list to find the gap.
Temporarily switch the default (in a lab copy) to SCMP_ACT_LOG so the program runs and the kernel logs every would-be-denied syscall — read the log, add the legitimate ones, then switch back to KILL_PROCESS. Never ship the LOG version.

Symptom: seccomp_load returns non-zero / errno=EACCES or EPERM.

Usually NO_NEW_PRIVS isn't set (only relevant for the raw prctl path — libseccomp sets it for you). If you dropped to a manual prctl(PR_SET_SECCOMP, ...), set PR_SET_NO_NEW_PRIVS first.
Check CONFIG_SECCOMP_FILTER is enabled: grep SECCOMP /boot/config-$(uname -r) or zgrep SECCOMP /proc/config.gz.

Symptom: works on your laptop, dies in CI/another distro.

Different glibc versions call different syscalls (newfstatat vs fstat, getrandom, rseq, clone3). Build the allow-list on the target image, not your dev box.

Questions to ask when it fails:

What is the exact syscall number in dmesg, and does the program genuinely need it?
Is the filter installed before the failing operation?
Am I filtering on a pointer argument (unsound) instead of a scalar?
Am I comparing my allow-list against a trace from the same environment?

Memory safety

Security & safety — detection and logging for seccomp.

seccomp is a detector as well as a preventer: a blocked syscall is a high-signal security event. Wire it into your monitoring.

What to log on a seccomp violation:

Timestamp (UTC, with timezone) and host identifier.
Process identity: PID, executable path, comm, and the responsible user/service account.
The syscall number/name that was blocked and the action taken (killed / errno).
Source context if known: the request/connection or job id that was being processed (a correlation id), so you can tie the violation to an input.
The security decision itself: "seccomp default-deny triggered."
Where to find it: the kernel audit subsystem records these (type=SECCOMP audit records via auditd); dmesg shows them without auditd. SCMP_ACT_LOG and denied KILL/ERRNO actions both surface here.

What to NEVER log:

The contents of syscall arguments if they can hold secrets — passwords, tokens, session cookies, private keys, full PANs, or raw request bodies. Log the syscall and metadata, not the payload.
Unnecessary PII. Log a correlation id you can join to request logs under access control, not the personal data itself.

Which events signal abuse:

Repeated blocked execve/execveat, socket/connect, ptrace, mount, unshare, bpf, or init_module from a process that has no business making them — a strong indicator that code execution was achieved and the attacker is probing the sandbox.
A sudden SIGSYS on a process that ran clean for months (possible new exploit or a supply-chain change).

How false positives arise:

A dependency or libc upgrade starts using a new syscall (clone3, getrandom, rseq) → legitimate SIGSYS that looks like an attack. Rebuild the allow-list against the new image and diff it.
A rarely-taken code path (error handling, a debug feature) uses a syscall your trace never exercised. Trace all representative paths, not just the happy path.

Memory-safety framing: seccomp does not replace bounds checking, -fstack-protector, ASan, or safe parsing. A buffer overflow still corrupts memory inside the process; seccomp only ensures the corrupted process can reach far fewer syscalls. Treat it as the last containment ring after the memory-safety work is done, not a substitute for it.

Real-world uses

Concrete authorized use case. A team runs an untrusted-code sandbox: users submit programs that are compiled and executed to grade an exercise (exactly the shape of a coding-judge platform). Each submission runs as a locked-down worker that installs a seccomp allow-list before executing user code, so a malicious submission cannot open a socket to exfiltrate data, execve a shell, or ptrace a sibling. The team builds the allow-list by tracing legitimate submissions, defaults to KILL_PROCESS, and ships blocked-syscall events to their SIEM with a per-submission correlation id. When a submission trips the filter, they can see which submission and which syscall — evidence, not guesswork.

Professional best-practice habits:

Validate + least privilege: the allow-list is least-privilege for capabilities. Pair it with dropped Linux capabilities, a read-only rootfs, and namespaces.
Secure defaults: default-deny (KILL_PROCESS), NO_NEW_PRIVS, filter installed before untrusted input.
Logging: record every violation with metadata + correlation id; never the payload.
Error handling: fail closed — if the filter can't load, refuse to run the workload.

Beginner vs advanced:

	Beginner	Advanced
Filter source	Handwritten small allow-list from `strace -c`	Generated per-service, diffed on every dependency bump in CI
Granularity	Syscall-level allow/deny	Argument-scalar filtering (fd/flags), `SECCOMP_RET_USER_NOTIF` to a broker
Enforcement	One process	Combined with namespaces, cgroups, capabilities, MAC (SELinux/AppArmor)
Ops	`dmesg` inspection	Audit records → SIEM alerts, dashboards, regression tests that assert denied syscalls stay denied

Ecosystem note. You rarely write raw BPF: Docker (--security-opt seccomp=profile.json), Kubernetes (securityContext.seccompProfile), and systemd (SystemCallFilter=) all consume declarative profiles built on the same allow-list idea. Reading those JSON/unit profiles is the auditor's day job.

Practice tasks

All tasks are lab-only: run on your own machine, a container, or a disposable VM. Never target a system you do not own or lack written authorization to test.

Authorization checklist (before any lab):

The machine is mine or explicitly authorized for this test.
It is isolated (localhost / container / throwaway VM), not production.
I have a way to reset it (snapshot, or rm the built binaries).

Beginner 1 — Minimal allow-list.

Objective: take a Hello-World C program and add a KILL_PROCESS default-deny seccomp filter that still lets it print.
Requirements: derive the syscall list from strace -c; program prints hello and exits 0.
Constraints: default action must be SCMP_ACT_KILL_PROCESS; no SCMP_ACT_ALLOW default.
Hints: remember exit_group and rt_sigreturn.
Concepts: seccomp_init, seccomp_rule_add, invisible syscalls.

Beginner 2 — Prove a denial.

Objective: to your Beginner-1 program, add a call to getpid() that you deliberately leave OFF the allow-list, and observe the kill.
Requirements: capture the exit status and the dmesg audit line naming the blocked syscall number.
Input/output: running the program should end in Bad system call (exit 159).
Constraints: do not add getpid to the list.
Defensive conclusion: explain in one line how you detected the violation (which log) and how you would remediate if the syscall were actually legitimate (add it after tracing).

Intermediate 1 — errno vs kill.

Objective: compare defaults: build two versions, one KILL_PROCESS, one ERRNO(EPERM), that both block socket.
Requirements: the errno version must print a clean "socket blocked (EPERM)" message and keep running; the kill version dies.
Output: a short table of observed behavior for each default.
Hints: check errno after the blocked call in the errno build.
Concepts: default-action trade-offs, fail-closed vs graceful degradation.

Intermediate 2 — Audit a profile.

Objective: read a Docker default seccomp profile (/etc/docker/seccomp.json or the upstream default.json) and list five syscalls it blocks and why each is dangerous (e.g. mount, ptrace, init_module, bpf, keyctl).
Requirements: for one blocked syscall, write two sentences of an audit finding: affected component, and the risk if it were allowed.
Constraints: analysis only — do not attempt any escape.
Concepts: allow/deny-list reading, severity reasoning (severity depends on exploitability + impact, not everything is critical).

Challenge — Sandbox a real parser and verify.

Objective: take a small program that parses untrusted input (e.g. reads a line and formats it) and box it with a tight allow-list installed before the read.
Requirements: (1) strace-derive the list on your target image; (2) install KILL_PROCESS default-deny; (3) add a MITIGATION VERIFICATION step: a script that feeds good input (must succeed) and an input that would trigger an execve/socket path (must be killed), asserting exit statuses; (4) log each violation with a correlation id.
Constraints: lab-only; filter must dominate all input-reading paths; no pointer-argument filters.
Defensive conclusion: produce a 3-line mini-report — what you hardened, how you verified the block works, and what you log to detect future violations — plus cleanup: delete the built binaries and, if you used a container/VM, revert the snapshot.
Concepts: everything above, end to end.

Summary

seccomp-BPF attaches an in-kernel BPF filter to a process's syscall path; for every syscall the kernel decides allow / kill / errno / trap / log. It is enforced by the kernel, so a compromised process cannot remove it.
Use a default-deny allow-list: seccomp_init(SCMP_ACT_KILL_PROCESS) then seccomp_rule_add(..., SCMP_ACT_ALLOW, SCMP_SYS(x), 0) for each needed syscall, then seccomp_load. Allow-lists fail closed; deny-lists fail open.
Build the list from a trace (strace -c), never from memory — the "invisible" syscalls (exit_group, rt_sigreturn, brk, mmap, getrandom) are the usual cause of surprise SIGSYS kills.
Install the filter before touching untrusted input, and fail closed if it can't load. Filter only on scalar args — pointer-argument filters are unsound (TOCTOU).
Verify every sandbox: allowed input must still work, a blocked syscall must produce SIGSYS/EPERM, and the kernel audit/dmesg line is your detection evidence. Log the syscall + metadata + correlation id; never the payload.
seccomp is defense-in-depth — it shrinks the blast radius of a compromise but does not fix the bug and does not make a system "completely secure." Test only on systems you own or are authorized to test, in an isolated lab, with cleanup.
Common mistakes to remember: guessing the list, loading too late, using a deny-list, trusting pointer filters, and overclaiming that seccomp alone equals security.