Safe Penetration Testing Labs · beginner · ~12 min

Count unique subdomains in a wordlist

## What you will learn - Walk a newline-separated text buffer in C and split it into logical lines without ever reading past the end. - Normalize each entry (trim, lowercase) into a bounded local buffer so comparisons are consistent and injection-safe. - Deduplicate entries with a small fixed-size table and a linear scan, understanding the O(n²) cost and why a hard cap keeps it safe. - Return clear status codes (count, `0` for `NULL`, `-1` for over-capacity) that an auditor can act on. - Treat an untrusted recon wordlist as hostile input: enforce size limits, avoid buffer overruns, and keep the tool offline (no DNS, no network).

Overview

Security objective. The asset you are protecting here is your own tooling and the machine it runs on. The threat is an untrusted input file — a subdomain wordlist that may be huge, contain absurdly long lines, embed control bytes, or lack a trailing newline. A careless parser that trusts this file can overrun a buffer, hang on O(n²) work, or crash. In this lesson you build a parser that safely reads such a file and returns the count of unique, non-empty entries — while enforcing bounds so malformed input cannot corrupt memory.

A subdomain wordlist is just plain text: one entry per line, separated by the newline byte \n. Tools in a passive-recon pipeline (Amass, Subfinder, dnsx and similar) emit and consume these files. Before you fan out and query anything, you deduplicate the list so you do not waste work on repeats.

This builds directly on your prerequisites. From C strings you use NUL-terminated char arrays, strlen, strcmp, and the idea that a buffer has a fixed capacity you must respect. From loops (for / while / do-while) you use a scan loop over the input and an inner search loop over the table of entries seen so far. The new idea is combining them safely: an outer pass over lines, an inner pass over the dedup table, and a cap that bounds the total work.

Crucially, this tool is offline. It never calls getaddrinfo, never opens a socket, never touches DNS. It transforms text to a number. That makes it safe to run and study on your own machine.

Why it matters

Why this matters

Deduplication is the first step in nearly every passive-recon pipeline. Recon (reconnaissance) is the information-gathering phase that precedes an authorized security test. Wordlists get concatenated from many sources, so duplicates are the norm, not the exception. Feeding duplicates downstream wastes time and can trip rate limits or alarms on the systems you are authorized to test.

Writing the deduplicator in C teaches a discipline that matters far beyond this exercise: handling untrusted input under a hard bound. Real tools crash or get exploited because they trust an input file's size or line length. A linear-scan dedup is O(n²) — the work grows with the square of the number of entries — so a 10-million-line file could hang your tool for minutes. Capping the entry count converts an unbounded risk into a predictable, testable failure mode.

In professional work you meet this pattern constantly: parsing scan output, log lines, CVE feeds, and configuration files that come from outside your trust boundary. The habits you practice here — bound the size, normalize before comparing, never write past a buffer, return an explicit status — are exactly the habits that separate a robust internal tool from one that becomes its own vulnerability.

Core concepts

1. Untrusted input and the trust boundary

Definition. A trust boundary is the line between data you control and data you do not. The wordlist file crosses that boundary: it may come from a teammate, a public repo, or the output of another tool.

Why it matters. Everything on the untrusted side must be validated before use. You cannot assume lines are short, that the file ends in \n, or that bytes are printable ASCII.

How it works. You treat the incoming const char *list as read-only and hostile. You copy each line into a bounded local buffer with a known capacity, truncating rather than overflowing.

When / when not. Always validate at the boundary. You do not need to re-validate data you generated yourself inside the program.

Pitfall. Assuming the last line ends with \n. Many files (and many editors) omit the final newline, so the last entry can be silently dropped if you only count \n characters.

2. Line splitting without overrun

Definition. Walking the buffer from start to NUL, treating each \n (and the final NUL) as a line terminator.

Plain explanation. Keep a pointer to the start of the current line. Advance until you hit \n or \0. The characters in between are one line.

How it works. The loop condition checks the current byte before dereferencing the next. The NUL terminator is your stop signal — you never read index strlen(list)+1.

When / when not. Use this for any \n-delimited in-memory buffer. For streaming files use fgets with a fixed buffer instead.

Pitfall. Off-by-one: reading one byte past the buffer when the line is exactly buffer-sized, or losing the last line because the loop exits on NUL before emitting it.

3. Normalization before comparison

Definition. Converting each entry to a canonical form — trim surrounding whitespace and lowercase every letter — so WWW, www, and www compare as equal.

Plain explanation. DNS names are case-insensitive, so a correct dedup must ignore case. Normalizing into a fixed local buffer also caps line length safely.

How it works. Copy up to CAP-1 bytes into a local array, applying tolower((unsigned char)c) to each, then NUL-terminate. Longer lines are truncated, not overflowed.

When / when not. Normalize when the comparison must be semantic (case, whitespace). Do not normalize away meaningful bytes when the exact form matters (for example, comparing password hashes).

Pitfall. Passing a plain char (which may be negative) to tolower is undefined behavior. Always cast to unsigned char first.

4. Bounded deduplication and the O(n²) cap

Definition. Keeping a table of entries already seen and, for each new entry, scanning the table to decide whether it is new.

Plain explanation. With N stored entries, each new line does up to N comparisons — that is O(n²) overall. A cap (256 here) bounds both memory and time.

How it works. Store each unique normalized entry in seen[MAX_ENTRIES]. On a new line, strcmp it against every stored entry; if none matches, store it and increment the count. If the table is full and the entry is new, return -1.

When / when not. Linear-scan dedup is fine for small, bounded lists. For large lists use a hash set — but that is a later lesson; here the cap is the safety mechanism.

Pitfall. Forgetting the cap turns a malicious 10-million-line file into a denial-of-service against your own tool.

Threat model

                TRUST BOUNDARY
                      |
  [ untrusted ]       |        [ trusted: your program ]
                      |
  wordlist file  -->  | -->  parser (this lesson)  -->  unique count (int)
  (const char *)      |       - bounded line buffer
   * huge size        |       - lowercase + trim
   * long lines       |       - dedup table[256]
   * control bytes    |       - cap => return -1
   * no final \n      |
                      |     ENTRY POINT: count_unique_domains(list)
                      |     ASSET: tool memory + CPU time (availability)
                      |     No network. No DNS. No file writes.

Knowledge check.

What asset does the size cap protect, and against which threat? (Answer: your tool's CPU/memory availability, against an oversized or malicious input file.)
Where is the trust boundary in this program? (Answer: at the const char *list parameter — everything it points to is untrusted.)
What insecure assumption would cause the last entry to be dropped? (Answer: assuming every line, including the last, is terminated by \n.)

Syntax notes

The key structure is a two-level loop plus a bounded copy. Here is the shape, annotated and lab-safe:

#include <ctype.h>   /* tolower            */
#include <string.h>  /* strcmp, strlen     */

#define MAX_ENTRIES 256   /* hard cap on distinct entries */
#define ENTRY_CAP   256   /* max bytes per normalized entry (incl. NUL) */

/* Copy [start, end) into buf, trimmed + lowercased, NUL-terminated.
 * Returns the normalized length (0 means the line was blank). */
static size_t normalize(const char *start, const char *end, char *buf) {
    /* trim leading spaces */
    while (start < end && isspace((unsigned char)*start)) start++;
    /* trim trailing spaces */
    while (end > start && isspace((unsigned char)end[-1])) end--;

    size_t n = 0;
    for (const char *p = start; p < end && n < ENTRY_CAP - 1; p++)
        buf[n++] = (char)tolower((unsigned char)*p);  /* bounded copy */
    buf[n] = '\0';                                     /* always terminate */
    return n;
}

Key points:

isspace/tolower receive (unsigned char) casts to avoid undefined behavior on bytes ≥ 128.
The copy loop stops at ENTRY_CAP - 1, so it can never overflow buf.
buf[n] = '\0' guarantees a valid C string even for a truncated line.
[start, end) is a half-open range: start is inclusive, end is exclusive — the standard C idiom for a slice.

Lesson

Why this matters

Subdomain wordlists from tools like Amass, Subfinder, or ProjectDiscovery's dnsx pipelines are just plain text, one entry per line, separated by \n.

Before you fan out and query those entries, you remove the duplicates. In this lesson you write that deduplicator in C. It runs entirely on a static buffer and performs no DNS lookups.

What the file looks like

www
api
mail
www       <- duplicate
admin
          <- empty line, ignored
api       <- duplicate

Your job

Implement the function:

int count_unique_domains(const char *list);

Requirements:

Return the number of unique, non-empty lines.
Ignore case when comparing (WWW and www are the same entry).
Return 0 for NULL input.

Size limit

This exercise is bounded. Assume the wordlist holds at most 256 entries.

If more than 256 distinct entries are present, return -1. This signals to the auditor that the cap was reached.

Common mistakes

Treating case-different lines as distinct (WWW vs www).
Counting empty lines.
Forgetting that the last line may not end in \n.

What this is NOT

A live DNS resolver. We never call getaddrinfo.
A wildcard or regex matcher.

Code examples

Code

The example follows an insecure → secure → verify shape. The first version is a common beginner attempt that overflows; the second is the safe fix; the third proves the fix with checks.

1. Insecure version

/* WARNING: intentionally vulnerable — use only in a local, isolated,
   authorized lab. Do not deploy. */
#include <string.h>
#include <ctype.h>

/* BUG 1: fixed 32-byte line buffer, no bounds check on the copy.
   BUG 2: no cap on the number of stored entries.
   BUG 3: assumes the last line ends in '\n'. */
int count_unique_bad(const char *list) {
    char seen[1000][32];
    int count = 0;
    const char *p = list;
    while (*p) {
        char line[32];
        int i = 0;
        while (*p && *p != '\n')
            line[i++] = tolower((unsigned char)*p++);  /* overflow if line > 31 */
        line[i] = '\0';
        if (*p == '\n') p++;
        int dup = 0;
        for (int k = 0; k < count; k++)
            if (strcmp(seen[k], line) == 0) dup = 1;
        if (!dup) strcpy(seen[count++], line);         /* overflow past 1000 */
    }
    return count;
}

A single line longer than 31 characters writes past line; more than 1000 unique entries writes past seen. Both are memory-corruption bugs an attacker could trigger with a crafted wordlist.

2. Secure version

#include <ctype.h>
#include <string.h>
#include <stddef.h>

#define MAX_ENTRIES 256   /* hard cap on distinct entries */
#define ENTRY_CAP   256   /* bytes per normalized entry, incl. NUL */

static size_t normalize(const char *start, const char *end, char *buf) {
    while (start < end && isspace((unsigned char)*start)) start++;
    while (end > start && isspace((unsigned char)end[-1])) end--;
    size_t n = 0;
    for (const char *p = start; p < end && n < ENTRY_CAP - 1; p++)
        buf[n++] = (char)tolower((unsigned char)*p);
    buf[n] = '\0';
    return n;
}

/* Returns unique non-empty entry count, 0 for NULL, -1 if > MAX_ENTRIES. */
int count_unique_domains(const char *list) {
    if (list == NULL) return 0;

    char seen[MAX_ENTRIES][ENTRY_CAP];
    int count = 0;

    const char *line_start = list;
    for (const char *p = list; ; p++) {
        if (*p == '\n' || *p == '\0') {
            char norm[ENTRY_CAP];
            size_t len = normalize(line_start, p, norm);
            if (len > 0) {                       /* skip empty lines */
                int dup = 0;
                for (int k = 0; k < count; k++) {
                    if (strcmp(seen[k], norm) == 0) { dup = 1; break; }
                }
                if (!dup) {
                    if (count >= MAX_ENTRIES) return -1;  /* cap reached */
                    memcpy(seen[count], norm, len + 1);
                    count++;
                }
            }
            if (*p == '\0') break;               /* handles missing final \n */
            line_start = p + 1;
        }
    }
    return count;
}

3. Verify (the fix rejects bad input and accepts good input)

#include <stdio.h>
#include <string.h>
#include <assert.h>

int count_unique_domains(const char *list);

int main(void) {
    /* ACCEPTS good input: dedup + case-insensitive + no final newline */
    assert(count_unique_domains("www\napi\nmail\nwww\nadmin\n\napi") == 4);
    assert(count_unique_domains("WWW\nwww\nWww") == 1);   /* case folded */
    assert(count_unique_domains("") == 0);
    assert(count_unique_domains(NULL) == 0);              /* NULL safe */
    assert(count_unique_domains("\n\n\n") == 0);          /* only blanks */

    /* REJECTS oversized input: 257 distinct entries -> -1, no overflow */
    char big[257 * 8];
    size_t off = 0;
    for (int i = 0; i < 257; i++)
        off += (size_t)snprintf(big + off, sizeof big - off, "h%d\n", i);
    assert(count_unique_domains(big) == -1);

    /* REJECTS an over-long line: truncated, never overflows */
    char longline[600];
    memset(longline, 'a', sizeof longline - 1);
    longline[sizeof longline - 1] = '\0';
    assert(count_unique_domains(longline) == 1);          /* still one entry */

    printf("all checks passed\n");
    return 0;
}

Expected output:

all checks passed

Compile and run with sanitizers to prove there is no memory error:

cc -std=c11 -Wall -Wextra -fsanitize=address,undefined dedup.c -o dedup
./dedup

If ASan/UBSan stay silent and all checks passed prints, the secure version handles every hostile case without corrupting memory.

Line by line

Line-by-line walkthrough

Walking the secure count_unique_domains:

if (list == NULL) return 0; — validate at the trust boundary before touching the pointer. NULL is a legitimate "no data" answer, not a crash.
char seen[MAX_ENTRIES][ENTRY_CAP]; — a fixed 256×256 table on the stack. Its size is known at compile time, so there is no unbounded allocation.
const char *line_start = list; — marks the beginning of the current line.
for (const char *p = list; ; p++) — scan every byte. The loop has no condition; it exits from inside when it processes the NUL.
if (*p == '\n' || *p == '\0') — both a newline and the terminating NUL end a line. Including \0 is what makes a missing final newline safe.
normalize(line_start, p, norm) — copy [line_start, p) into norm, trimmed and lowercased, bounded to ENTRY_CAP-1 bytes.
if (len > 0) — a length of 0 means the line was empty or all whitespace; skip it so blanks are never counted.
Inner loop for (k ...) strcmp — compare against each stored entry; break on the first match. This is the O(n²) step, bounded by the cap.
if (count >= MAX_ENTRIES) return -1; — the cap check happens before writing, so the table can never overflow.
memcpy(seen[count], norm, len + 1); — copy the string plus its NUL. len came from a bounded copy, so this is safe.
if (*p == '\0') break; — stop after processing the final line.
line_start = p + 1; — next line starts just after the newline.

Trace for input `"WWW\nwww\napi"`

Step	Line seen	norm	In table?	count after
1	`WWW`	`www`	no	1
2	`www`	`www`	yes (dup)	1
3	`api` (no final \n)	`api`	no	2

Result: 2. The duplicate collapses by case folding, and the last line is counted even without a trailing newline.

Common mistakes

Wrong approach	Why it is wrong	Corrected	How to recognize / prevent
Copy the line with an unbounded `while (*p != '\n')` into a fixed buffer	Overflows the buffer on a long line — memory corruption	Bound the copy to `ENTRY_CAP-1` and truncate	Build with `-fsanitize=address`; a long-line test aborts if unbounded
Count `\n` characters to count lines	Drops the last line when the file has no trailing newline	Treat both `\n` and `\0` as terminators	Test an input that does not end in `\n`
`strcmp` raw lines without lowercasing	`WWW` and `www` count as two distinct hosts — wrong for DNS	Normalize (lowercase + trim) before comparing	Add a case-fold assertion to the test suite
No cap on stored entries	A huge file overflows the table or hangs on O(n²) work (self-DoS)	Return `-1` once `MAX_ENTRIES` is reached	Feed 257 distinct entries and expect `-1`
`tolower(*p)` with a plain `char`	Undefined behavior when the byte is ≥ 128 (negative char)	`tolower((unsigned char)*p)`	UBSan flags it; make the cast a habit
Counting empty or whitespace-only lines	Inflates the count and wastes downstream queries	Skip when normalized length is 0	Test input of only blank lines expects `0`

Debugging tips

Wrong count by one. Usually the last-line case. Print each normalized entry with fprintf(stderr, "[%s]\n", norm) and confirm the final entry appears. Check that your terminator test includes \0, not just \n.
Crash or ASan report. Run cc -std=c11 -fsanitize=address,undefined .... AddressSanitizer names the exact overflowing write. If it points at the copy loop, your line buffer is unbounded; if at seen[count], your cap check is missing or runs after the write.
Case-sensitivity bug. Duplicates that differ only in case slip through. Log the normalized form, not the raw line, and verify it is fully lowercased.
Hang on large input. If the tool freezes, you have no cap. Confirm if (count >= MAX_ENTRIES) return -1; runs before storing.
Questions to ask when it fails: Did I validate NULL first? Is every buffer write bounded by a known capacity? Does my line loop terminate on \0 as well as \n? Am I comparing normalized strings on both sides? Does a truncated long line still produce a valid NUL-terminated string?

Memory safety

Memory safety and detection

Memory / undefined-behavior safety (C)

Every buffer write is bounded. The normalize copy stops at ENTRY_CAP - 1; the table write is gated by count >= MAX_ENTRIES. No path writes past a buffer.
Always NUL-terminate after a bounded copy (buf[n] = '\0') so later strcmp/strlen never run off the end.
Cast to unsigned char before tolower/isspace; passing a negative char is undefined behavior.
memcpy(..., len + 1) copies exactly the string plus its terminator — len came from a bounded source, so the size is trustworthy.
Half-open ranges ([start, end)) avoid off-by-one reads at the line boundary.
Build every time with -Wall -Wextra -fsanitize=address,undefined while developing.

Security and safety: detection & logging

Even an offline parser deserves an audit trail when it runs inside a recon pipeline. Log: a timestamp, the source file path or identifier, the input size in bytes, the entry count returned, and the security decision — for example result=capped when the function returns -1 or result=truncated_line when a line exceeded ENTRY_CAP. Add a run/correlation id so the parse can be tied back to the wider engagement.

Never log: the raw wordlist contents beyond what is needed (recon lists can contain sensitive internal hostnames), and never any secrets, tokens, or credentials that a mislabeled input file might contain. Log counts and decisions, not full payloads.

Events that signal abuse or trouble: repeated -1 (cap-reached) results suggest someone is feeding oversized or generated lists; a spike in truncated_line events suggests malformed or hostile input. False positives arise legitimately — a large but valid merged wordlist can hit the cap without any attack — so treat these as signals to investigate and to consider a larger-capacity build, not as proof of an attacker.

Why only in an authorized lab. This parser is harmless on its own, but the recon pipeline it feeds performs enumeration against targets. Run and study it against systems you own or are explicitly authorized to test.

Real-world uses

Authorized use case. During an authorized external assessment, a tester merges subdomain wordlists from several open-source tools into one file, then deduplicates before resolution to avoid redundant DNS queries against the client's authorized scope. A small, audited C deduper like this one is a dependable pre-processing step: no network, no surprises, predictable memory.

Professional best-practice habits

Habit	What it looks like here
Input validation	NULL check, size cap, per-line length bound
Least privilege	Runs with no network access and no write access to the target files
Secure defaults	Case-insensitive compare, blank lines ignored, explicit `-1` on overflow
Logging	Record source, byte size, result count, and any cap/truncation decision
Error handling	Distinct return codes (`count`, `0`, `-1`) an auditor can branch on

Beginner focus: get the bounds and the last-line case right, and prove them with a sanitizer build and a small test suite.

Advanced focus: replace the O(n²) linear scan with a hash set for large lists, stream from disk with fgets instead of a preloaded buffer, add Unicode/IDNA-aware normalization, and emit structured (JSON) audit logs for the pipeline. Always keep the size cap or an equivalent resource guard.

Practice tasks

Beginner 1 — Count non-empty lines

Objective: Implement int count_lines(const char *list) returning the number of non-empty lines.
Requirements: Skip empty and whitespace-only lines; handle a missing final newline; return 0 for NULL.
Input / output: "a\n\nb\nc" → 3.
Constraints: No dynamic allocation; single pass.
Hints: Reuse the \n-or-\0 terminator idea; trim before deciding if the line is empty.
Concepts: line splitting, trust-boundary NULL check.

Beginner 2 — Lowercase normalizer

Objective: Implement size_t normalize(const char *start, const char *end, char *buf) as in the lesson.
Requirements: Trim leading/trailing whitespace, lowercase, bound the copy to ENTRY_CAP-1, always NUL-terminate.
Input / output: " WWW " → www (length 3).
Constraints: Cast to unsigned char before tolower/isspace.
Hints: Use a half-open range; stop the copy at the capacity limit.
Concepts: bounded copy, normalization.

Intermediate 1 — Count by suffix

Objective: Implement int count_of_domain(const char *list, const char *domain) counting lines whose normalized form ends with domain.
Requirements: Case-insensitive; skip empty lines; domain is trusted but may be longer than a line.
Input / output: list of hosts, domain = ".example.com" → count ending with it.
Constraints: No allocation; compare only after normalizing.
Hints: Compare the tail using strlen offsets; guard against a suffix longer than the line.
Concepts: normalization, suffix comparison, bounds checks.

Intermediate 2 — Capacity-aware dedup with reporting

Objective: Extend count_unique_domains to also write, via an out-parameter int *truncated_lines, how many lines were truncated at ENTRY_CAP.
Requirements: Keep the -1 over-capacity behavior; do not change the return contract otherwise.
Input / output: returns unique count and sets *truncated_lines.
Constraints: Still no unbounded writes; sanitizer-clean.
Hints: normalize can return whether it hit the cap; increment the counter there.
Concepts: detection/logging signal, out-parameters, bounded copy.

Challenge — Lab-only hardening exercise

Objective: In a local, isolated, authorized lab, compile the insecure count_unique_bad version, reproduce the overflow with a crafted long-line and over-count input under AddressSanitizer, then remediate.
Requirements: Document the crash (which buffer, which write), apply the bounded/capped fix, and verify the fix rejects the same malicious inputs while still accepting a normal wordlist. Add an audit log line recording size, result, and any cap/truncation decision.
Constraints: Everything runs on localhost only; use synthetic inputs you generate; no real hostnames.
Authorization checklist: (1) It is your own machine or an explicitly authorized lab VM. (2) Inputs are synthetic, not real client data. (3) No network calls are made. (4) You have permission to run these tools here.
Cleanup / reset: Delete the compiled binaries and any generated wordlist files; clear the scratch directory.
Defensive conclusion: The deliverable is the fixed function plus evidence (ASan-clean run, passing tests) that the vulnerability is remediated — not the exploit.
Concepts: overflow reproduction, bounded remediation, mitigation verification, logging.

Summary

Main concepts. Treat the wordlist as untrusted input crossing a trust boundary; split on \n (and \0 for the last line); normalize (trim + lowercase) into a bounded buffer; deduplicate with a fixed 256-slot table; return -1 when the cap is reached.
Key syntax / commands. tolower((unsigned char)c), half-open [start, end) ranges, memcpy(dst, src, len + 1), and building with cc -std=c11 -Wall -Wextra -fsanitize=address,undefined.
Common mistakes. Unbounded line copies, counting only \n (dropping the last line), case-sensitive comparison, no size cap, tolower on a plain char, and counting blank lines.
What to remember. Bound every write, validate at the boundary, normalize before comparing, cap the work, and return explicit status codes. A parser that trusts its input is itself a vulnerability; a bounded one is a dependable tool. This stays offline — no DNS, no sockets — and only feeds pipelines you are authorized to run.

Count unique subdomains in a wordlist

Overview

Overview

Why it matters

Why this matters

Core concepts

Core concepts

1. Untrusted input and the trust boundary

2. Line splitting without overrun

3. Normalization before comparison

4. Bounded deduplication and the O(n²) cap

Threat model

Syntax notes

Syntax notes

Lesson

Why this matters

What the file looks like

Your job

Size limit

Common mistakes

What this is NOT

Code examples

Code

1. Insecure version

2. Secure version

3. Verify (the fix rejects bad input and accepts good input)

Line by line

Line-by-line walkthrough

Trace for input "WWW\nwww\napi"

Common mistakes

Common mistakes

Debugging tips

Debugging tips

Memory safety

Memory safety and detection

Memory / undefined-behavior safety (C)

Security and safety: detection & logging

Real-world uses

Real-world uses

Professional best-practice habits

Practice tasks

Practice tasks

Beginner 1 — Count non-empty lines

Beginner 2 — Lowercase normalizer

Intermediate 1 — Count by suffix

Intermediate 2 — Capacity-aware dedup with reporting

Challenge — Lab-only hardening exercise

Summary

Summary

Practice with these exercises

Trace for input `"WWW\nwww\napi"`