C Basics · beginner · ~15 min

State machines for parsers

Decompose stateful text parsing into explicit states + transitions.

Overview

State machines turn 'this code has too many nested ifs' into a clean table of states + transitions. Every text parser benefits.

Why it matters

Most real C parsers (HTTP, JSON, shell command line) are state machines. Recognising the pattern lets you implement them cleanly.

Core concepts

State. A small integer or enum describing what the parser is currently looking for.

Transition. On each input byte, decide the next state (often: 'stay in current', 'advance to next').

Action on transition. Emit a token, increment a counter, or save the run so far.

Pentester mindset. Parser divergence between two FSMs (e.g. proxy vs back-end) is request smuggling. Test your FSM against the most pathological inputs your peers can produce.

Defensive coding habit. Always have a default branch that rejects. 'Be liberal in what you accept' is a security anti-pattern.

Syntax notes

enum { OUT, IN_WORD, IN_QUOTE } state = OUT;
for (...; ...) {
    switch (state) {
    case OUT: ...; break;
    case IN_WORD: ...; break;
    case IN_QUOTE: ...; break;
    }
}

Lesson

A finite state machine (FSM) tracks 'what mode am I in?' as you walk input. Examples: word-counting (in-word vs not), command-line parser (out-of-token vs in-token vs in-quote), URL parser (scheme → authority → path → query).

Code examples

int in_word = 0;
for (const char *p = s; *p; p++) {
    if (isspace((unsigned char)*p)) in_word = 0;
    else if (!in_word) { in_word = 1; words++; }
}

Line by line

/* simple shell tokenizer: OUT -> IN_WORD on non-space; IN_WORD -> OUT on space */
enum { OUT, IN_WORD } st = OUT;
for (const char *p = line; *p; p++) {
    if (isspace((unsigned char)*p)) st = OUT;
    else if (st == OUT) { st = IN_WORD; tokens[ntok++] = p; }
}

Common mistakes

  • Mixing state variables in nested ifs until you can't read them. Make state explicit.

Debugging tips

Log every state transition (state X -> Y on byte Z) while debugging. The trace is usually self-explaining.

Memory safety

FSMs are pure logic — no memory hazards unless you forget to bound the output buffer for the tokens you emit.

Real-world uses

HTTP parsers, shell tokenizers, JSON / XML / TOML parsers, lexers in every language toolchain.

Practice tasks

  1. Word counter as an FSM. 2. Shell tokenizer with quote handling. 3. URL parser following RFC 3986 states.

Summary

Explicit state + explicit transition + a default-reject branch. That's the parser-correctness recipe.

Practice with these exercises