C Basics · beginner · ~15 min
Decompose stateful text parsing into explicit states + transitions.
State machines turn 'this code has too many nested ifs' into a clean table of states + transitions. Every text parser benefits.
Most real C parsers (HTTP, JSON, shell command line) are state machines. Recognising the pattern lets you implement them cleanly.
State. A small integer or enum describing what the parser is currently looking for.
Transition. On each input byte, decide the next state (often: 'stay in current', 'advance to next').
Action on transition. Emit a token, increment a counter, or save the run so far.
Pentester mindset. Parser divergence between two FSMs (e.g. proxy vs back-end) is request smuggling. Test your FSM against the most pathological inputs your peers can produce.
Defensive coding habit. Always have a default branch that rejects. 'Be liberal in what you accept' is a security anti-pattern.
enum { OUT, IN_WORD, IN_QUOTE } state = OUT;
for (...; ...) {
switch (state) {
case OUT: ...; break;
case IN_WORD: ...; break;
case IN_QUOTE: ...; break;
}
}
A finite state machine (FSM) tracks 'what mode am I in?' as you walk input. Examples: word-counting (in-word vs not), command-line parser (out-of-token vs in-token vs in-quote), URL parser (scheme → authority → path → query).
int in_word = 0;
for (const char *p = s; *p; p++) {
if (isspace((unsigned char)*p)) in_word = 0;
else if (!in_word) { in_word = 1; words++; }
}
/* simple shell tokenizer: OUT -> IN_WORD on non-space; IN_WORD -> OUT on space */
enum { OUT, IN_WORD } st = OUT;
for (const char *p = line; *p; p++) {
if (isspace((unsigned char)*p)) st = OUT;
else if (st == OUT) { st = IN_WORD; tokens[ntok++] = p; }
}
Log every state transition (state X -> Y on byte Z) while debugging. The trace is usually self-explaining.
FSMs are pure logic — no memory hazards unless you forget to bound the output buffer for the tokens you emit.
HTTP parsers, shell tokenizers, JSON / XML / TOML parsers, lexers in every language toolchain.
Explicit state + explicit transition + a default-reject branch. That's the parser-correctness recipe.