Arrays & Strings · beginner · ~12 min

Searching for a substring

Find one string inside another; understand the algorithmic options.

Overview

Substring search is the most-used string primitive. Most production code uses strstr (good enough for short patterns) or memmem (binary-safe).

Why it matters

Logs, headers, configs, payloads — all string content where you ask 'does this contain X'. Knowing the cost matters when scanning gigabytes.

Core concepts

Naive. Walk haystack, at each position compare against needle. O(n*m).

KMP / Boyer-Moore. Pre-process needle for O(n) average or sub-linear average. Code is fiddly; rarely needed in pen-test work.

Multi-pattern. Aho-Corasick is the algorithm; YARA and ClamAV use variants.

Pentester mindset. Anti-malware engines, IDS rules, and log scanners are all multi-pattern search at scale.

Defensive coding habit. Always cap needle length; never assume haystack is NUL-terminated when it came from a network read — use memmem with explicit length.

Syntax notes

const char *strstr(const char *haystack, const char *needle);
void *memmem(const void *h, size_t hn, const void *n, size_t nn);  /* GNU/BSD */

Lesson

C's standard strstr finds the first occurrence of needle in haystack with a naive O(n*m) scan. For larger inputs use Boyer-Moore or KMP (sub-linear average). For multi-pattern, use Aho-Corasick.

Code examples

const char *p = strstr(haystack, needle);
if (p) printf("found at offset %td\n", p - haystack);

Line by line

int contains(const char *line, const char *bad){
    return strstr(line, bad) != NULL;
}

Common mistakes

  • Treating an empty needle differently than the spec (strstr returns haystack for empty).

Debugging tips

Print the offset where the match was found and the surrounding ~20 chars; lets you sanity-check matches by eye.

Memory safety

strstr walks until NUL. For binary buffers use memmem or write a length-bounded variant.

Real-world uses

Log filters, malware signature scanners, web-server URL routers, intrusion detection.

Practice tasks

  1. Wrap strstr for case-insensitive search. 2. Implement count_occurrences. 3. Reject any line containing '../' (path-traversal allow-list check).

Summary

Use strstr for short patterns; memmem for binary; Aho-Corasick for many patterns. Always cap lengths.

Practice with these exercises