Safe Penetration Testing Labs · intermediate · ~15 min

Score a URL for phishing markers

## What you will learn - Extract the **host** (authority) portion of a URL string in C without a heavy parsing library. - Implement a small, deterministic, rule-based **phishing score** that adds one point per suspicious structural trait. - Reason about why each heuristic (an `@` in the authority, IP-as-host, punycode `xn--`, brand-plus-dash, long hosts, digit runs) is a phishing signal — and why each is weak on its own. - Write the scan as a **single, bounds-checked pass** over the host so there are no out-of-range reads or off-by-one bugs. - Handle the defensive edge cases: `NULL` input, missing scheme, a host with no trailing `/`, and the difference between **host** and **path**. - Understand where this fits in a layered detector and what it deliberately does *not* do (no network fetch, no blocklist lookup).

Overview

A phishing URL is a web address crafted to look like a trusted brand so a victim clicks it and hands over a password, a card number, or a one-time code. Before any expensive check runs, defenders look at the shape of the URL itself — its structure — because structure is cheap to inspect and attackers reuse the same tricks over and over.

This lesson builds a function that reads a URL string and returns an integer score: the higher the score, the more structural "smells" the URL has. We never connect to the URL. We only look at the text.

The word heuristic means a quick rule of thumb. A heuristic is not a proof — a legitimate site can trip a rule, and a clever attacker can avoid one. But a bundle of cheap heuristics filters out the obvious cases so the slow, accurate layers (machine learning, live-page fetch) only run on the survivors.

This builds directly on two earlier lessons. From C strings you already know that a C string is a char array terminated by \0, that strlen, strchr, and strstr walk that array, and that reading past the terminator is undefined behaviour. From for / while / do-while you know how to loop over characters with an index and stop at the right boundary. Here we combine both: we find the host with strstr/strchr, then loop across only the host bytes to count digits and dots.

Key terms, in the order they appear:

Scheme — the part before ://, e.g. https.
Authority / host — what sits between // and the next /, ?, or end of string, e.g. login.paypal.com.
Path — everything after the host, e.g. /account/verify.
Userinfo — an optional user@ placed inside the authority; the classic spoofing trick.
Punycode (xn--) — an ASCII encoding of a non-ASCII (internationalized) domain label.

Why it matters

Phishing is the single most common entry point for account takeover and ransomware. The defender's problem is volume: a mail gateway or proxy may see millions of URLs an hour and cannot fetch each one.

Real detectors are built in layers, cheapest first:

        incoming URL
             |
   +---------v----------+   microseconds, no network
   | Layer 1: structure |  <-- THIS LESSON
   |  (cheap heuristics) |
   +---------+----------+
             | survivors only
   +---------v----------+   milliseconds, model inference
   | Layer 2: ML model  |
   +---------+----------+
             | survivors only
   +---------v----------+   seconds, fetches the live page
   | Layer 3: live fetch|
   +--------------------+

Layer 1 is the part you can run inline on every request without melting the budget. A structural score also gives analysts an auditable reason ("host contains @ and a brand-plus-dash") instead of an opaque model output, which matters for incident response and for tuning false positives. Learning to write it safely in C also teaches the broader skill of parsing untrusted input without buffer overruns — the same discipline you need everywhere in security work.

Core concepts

1. The authority is the only part you score

Definition. The authority (commonly called the host here) is the chunk of a URL between // and the next /, ?, #, or the end of the string. In https://login.paypal.com/verify the authority is login.paypal.com.

Why it matters. The host decides where your browser actually connects. The path (/verify) is just a label the server interprets — anyone can put paypal in a path. So a brand keyword in the path means nothing; the same keyword in the host is a signal.

How it works internally. Find // with strstr, step 2 bytes past it to the host start, then the host ends at the first /, ?, or #, or at the \0. If there is no //, treat the whole string as the host (defensive default).

https://login.paypal.com/verify?id=9
^^^^^   ^^^^^^^^^^^^^^^^ ^^^^^^^^^^^
scheme       host           path+query
      ||
      strstr(url, "//") points here; +2 -> host start
                       ^ first '/' -> host end

When to use / when not. This two-marker approach is fine for a heuristic. Do not use it as a real parser: it ignores ports (:8080), userinfo (user@host), IPv6 literals ([::1]), and percent-encoding. For anything that makes a security decision you trust, use a real URL library.

Pitfall. Forgetting that the host can run to the end of the string (no trailing /), e.g. http://evil.tld. If you only stop at /, you read past the end. Always also stop at \0.

Knowledge check: In http://a.com@b.com/x, which host does the browser connect to, and why is the @ the dangerous part?

2. The userinfo (`@`) spoofing trick

Definition. Inside the authority, userinfo@host lets a URL carry login info. Everything before the last @ is userinfo; the real host is what follows.

Plain language. https://paypal.com@evil.tld/ looks like it goes to paypal.com, but the browser connects to evil.tld. paypal.com is just a username it ignores.

How it works. The browser splits the authority at the last @; the left side is decoration, the right side is the destination. A human reading left-to-right sees the trusted brand first and stops reading.

When to flag. Any @ in the authority is suspicious for a phishing heuristic — legitimate web URLs almost never use userinfo today.

Pitfall. An @ in the path or query (e.g. ?email=a@b.com) is normal. If you search the whole URL for @ you will false-positive on those. For a beginner heuristic we accept that trade-off, but a stronger version checks only within the authority.

Knowledge check (predict the output): For https://trusted.com@evil.tld/login, our beginner rule scores +1 for the @. Is the real destination trusted.com or evil.tld?

3. Counting structural smells

Each rule below adds one point. None is conclusive; the sum is the signal.

Rule	Why it is suspicious	Weakness
URL contains `@`	userinfo spoofing (Concept 2)	`@` can appear in query strings
Host has > 4 dots	deep subdomain stacking like `secure.login.account.paypal.evil.tld`	some real CDNs have many dots
Host length > 40	long hosts hide the real domain off-screen	a few legit hosts are long
Host has a digit run >= 3	`login123`, throwaway numbered hosts	versioned hosts use digits
URL contains `xn--`	punycode can imitate letters (homoglyphs)	many legit IDNs use it
Host has a dash and a brand word (`paypal`, `apple`, `bank`, `microsoft`, `google`, `amazon`)	`paypal-secure.tld` style lures	brand words appear legitimately

How the digit-run check works. Walk the host one char at a time; keep a counter run that increments on a digit and resets to 0 on a non-digit. If run ever reaches 3, the rule fires. This is the loop skill from the prerequisite lesson applied to a bounded range.

h o m e 1 2 3 . c o m
          ^ run=1
            ^ run=2
              ^ run=3  -> rule fires

Pitfall. Counting digits across the whole URL instead of the host means a path like /order/100025 falsely fires. Bound the loop to the host slice.

Knowledge check (find-the-bug): A learner writes for (int i = 0; url[i]; i++) to count the digit run. Why does scoring the path break the intent of this rule, and how do you fix it?

Syntax notes

Key library calls, all from <string.h> (see the C strings prerequisite):

#include <string.h>

char *p = strstr(url, "//");   // first occurrence of "//", or NULL
char *at = strchr(host, '@');  // first '@' in host, or NULL
size_t n = strlen(host);       // bytes before the '\0'
int d = isdigit((unsigned char)c); // <ctype.h>; cast avoids UB on signed char

A tiny annotated skeleton of the host-finding logic:

const char *host = url;                 // default: no scheme present
const char *scheme = strstr(url, "//"); // look for '//'
if (scheme) host = scheme + 2;          // step past the two slashes

size_t host_len = 0;                    // length of just the host slice
while (host[host_len] != '\0' &&        // stop at end of string ...
       host[host_len] != '/'  &&        // ... or path ...
       host[host_len] != '?'  &&        // ... or query ...
       host[host_len] != '#') {         // ... or fragment
    host_len++;
}
// now host[0 .. host_len-1] is the authority

Note the cast (unsigned char)c before isdigit: passing a negative char to a <ctype.h> function is undefined behaviour, a classic C trap.

Lesson

What we check

Phishing URL detectors look at structural smells before they look at content. Structural means the shape of the URL itself, not the page it points to.

Common smells include:

Too many dots in the hostname.
An IP address used in place of a domain name.
IDN punycode (an encoded form of non-ASCII domain names).
An @ symbol inside the authority (the host portion of the URL).
Dashes inside the second-level domain.
A suspiciously long hostname.

We score these signals. We do not fetch the URL.

Heuristics (one point each)

The URL contains @ (authority spoofing).
The hostname has more than 4 dots.
The hostname is longer than 40 characters.
The hostname contains a digit run of length 3 or more (for example, login123).
The URL contains xn-- (IDN punycode). This is a neutral signal on its own, but it is often abused.
The hostname contains a dash and a known brand keyword: paypal, apple, bank, microsoft, google, or amazon.

Your job

Implement int phishy_score(const char *url).

Sum the matching signals and return the total.
If url is NULL, return -1.

Common mistakes

Scoring the path. Brand keywords in the path do not count. Only the hostname matters.
Misreading xn--. It is a prefix of a label, not a substring of the whole URL. It is still useful as a flag here.

What this is NOT

A real URL parser. Proper parsing must handle userinfo, ports, IPv6 literals, and percent-encoding. That is a follow-up exercise.
A blocklist consulter.

Code examples

#include <stdio.h>
#include <string.h>
#include <ctype.h>

/* Brand keywords that, combined with a '-' in the host, suggest a lure. */
static const char *BRANDS[] = {
    "paypal", "apple", "bank", "microsoft", "google", "amazon"
};
static const size_t BRAND_COUNT = sizeof(BRANDS) / sizeof(BRANDS[0]);

/* Locate the authority (host). Writes its length via *out_len.
   Returns a pointer INTO url (not a copy), or NULL if url is NULL. */
static const char *find_host(const char *url, size_t *out_len) {
    if (url == NULL) return NULL;
    const char *host = url;                 /* default if no scheme */
    const char *scheme = strstr(url, "//");
    if (scheme) host = scheme + 2;          /* skip the two slashes */

    size_t len = 0;
    while (host[len] != '\0' && host[len] != '/' &&
           host[len] != '?'  && host[len] != '#') {
        len++;                              /* host ends at path/query/frag/NUL */
    }
    *out_len = len;
    return host;
}

/* Case-insensitive search for needle within the first n bytes of haystack. */
static int contains_ci(const char *haystack, size_t n, const char *needle) {
    size_t m = strlen(needle);
    if (m == 0 || m > n) return 0;
    for (size_t i = 0; i + m <= n; i++) {   /* never read past haystack[n-1] */
        size_t j = 0;
        while (j < m &&
               tolower((unsigned char)haystack[i + j]) ==
               tolower((unsigned char)needle[j])) {
            j++;
        }
        if (j == m) return 1;
    }
    return 0;
}

/* Rule-based phishing score. Returns -1 on NULL input. */
int phishy_score(const char *url) {
    if (url == NULL) return -1;

    size_t host_len = 0;
    const char *host = find_host(url, &host_len);
    int score = 0;

    /* Rule 1: an '@' anywhere in the URL (userinfo spoofing). */
    if (strchr(url, '@') != NULL) score++;

    /* Rule 2: more than 4 dots in the host. */
    size_t dots = 0, run = 0, max_run = 0;
    for (size_t i = 0; i < host_len; i++) {
        char c = host[i];
        if (c == '.') dots++;
        if (isdigit((unsigned char)c)) {    /* Rule 4 bookkeeping */
            run++;
            if (run > max_run) max_run = run;
        } else {
            run = 0;
        }
    }
    if (dots > 4) score++;

    /* Rule 3: host longer than 40 characters. */
    if (host_len > 40) score++;

    /* Rule 4: a digit run of length 3 or more in the host. */
    if (max_run >= 3) score++;

    /* Rule 5: IDN punycode prefix appears in the URL. */
    if (strstr(url, "xn--") != NULL) score++;

    /* Rule 6: a dash in the host AND a known brand keyword in the host. */
    int has_dash = (memchr(host, '-', host_len) != NULL);
    if (has_dash) {
        for (size_t b = 0; b < BRAND_COUNT; b++) {
            if (contains_ci(host, host_len, BRANDS[b])) { score++; break; }
        }
    }
    return score;
}

int main(void) {
    const char *samples[] = {
        "https://login.paypal.com/account",          /* clean brand host */
        "https://paypal.com@evil.tld/login",         /* userinfo spoof */
        "http://secure-paypal.account.verify.tld/",  /* dash+brand, dots */
        "https://xn--80ak6aa92e.com/",               /* punycode */
        "http://home123456.example.org/order/9"      /* digit run */
    };
    size_t count = sizeof(samples) / sizeof(samples[0]);
    for (size_t i = 0; i < count; i++) {
        printf("%2d  %s\n", phishy_score(samples[i]), samples[i]);
    }
    printf("%2d  (NULL)\n", phishy_score(NULL));
    return 0;
}

What it does. phishy_score finds the host once, then runs all six rules — most in a single pass over the host bytes. main prints the score for five test URLs plus the NULL case.

Expected output:

 0  https://login.paypal.com/account
 1  https://paypal.com@evil.tld/login
 2  http://secure-paypal.account.verify.tld/
 1  https://xn--80ak6aa92e.com/
 1  http://home123456.example.org/order/9
-1  (NULL)

The clean PayPal host scores 0 (no dash before the brand, <= 4 dots, short, no digit run, no @). The userinfo spoof scores 1 (the @). secure-paypal.account.verify.tld scores 2 (dash+brand, and note it has 4 dots which is *not* > 4, so the dots rule does not fire — only dash+brand fires; the +2 comes from dash+brand plus... re-check below).

Edge cases to know: a host with no trailing /, a URL with no scheme at all (paypal-login.tld), an @ that lives only in the query, and an empty string ("" -> host_len 0 -> score 0).

Line by line

We trace phishy_score("http://secure-paypal.account.verify.tld/").

url is non-NULL, so we continue.
find_host runs strstr(url, "//") -> points at //secure-...; host = scheme + 2 -> secure-paypal.account.verify.tld/.
The while loop advances len until it hits the trailing /. The host slice is secure-paypal.account.verify.tld, host_len = 32.
Rule 1: strchr(url, '@') is NULL -> no point.
The single pass counts dots and tracks digit runs over the 32 host bytes:

char span	dots	max_run
`secure-paypal`	0	0
`.account`	1	0
`.verify`	2	0
`.tld`	3	0

Final: dots = 3, max_run = 0. 6. Rule 2: dots > 4? 3 > 4 is false -> no point. 7. Rule 3: host_len > 40? 32 > 40 is false -> no point. 8. Rule 4: max_run >= 3? 0 >= 3 is false -> no point. 9. Rule 5: strstr(url, "xn--") is NULL -> no point. 10. Rule 6: memchr(host, '-', 32) finds the dash in secure-paypal -> has_dash = 1. We scan brands; contains_ci matches paypal inside the host -> +1, then break. 11. Final score = 1.

This corrects the earlier prose: this URL scores 1, not 2, because it has exactly 3 dots (rule needs > 4) and the only firing rule is dash+brand. Tracing by hand like this is how you catch your own off-by-one assumptions before the compiler can.

For the NULL call: the very first if (url == NULL) return -1; fires and nothing else runs — no dereference, no crash.

Common mistakes

Mistake 1: Reading past the end of the string

Wrong:

while (host[len] != '/') len++;   // never checks for '\0'

For http://evil.tld (no trailing /) this walks off the end of the buffer — undefined behaviour, possible crash or garbage.

Right: stop on '\0' and on path delimiters:

while (host[len] && host[len] != '/' && host[len] != '?' && host[len] != '#') len++;

Recognize it: AddressSanitizer reports a heap/stack overflow read; scores vary run-to-run on the same input.

Mistake 2: Scoring the path instead of the host

Wrong: counting dots/digits over the whole url. http://example.com/order/100025 would fire the digit-run rule on 100025 even though the host is clean.

Right: bound every loop to host_len and the host pointer. Only Rules 1 and 5 (@, xn--) intentionally scan the whole URL, because those tricks can appear before or around the host.

Mistake 3: Off-by-one in the substring search

Wrong:

for (size_t i = 0; i <= n; i++) { ... haystack[i + j] ... }  // i <= n over-reads

Right: loop while i + m <= n so the comparison never touches haystack[n] or beyond.

Mistake 4: Passing a signed char to isdigit/tolower

Wrong: isdigit(c) where c is char. A byte >= 0x80 becomes negative and is undefined behaviour.

Right: isdigit((unsigned char)c). This matters precisely for the non-ASCII hosts a phishing tool sees.

Mistake 5: Treating the score as a verdict

Wrong: blocking everything with score >= 1. Legitimate sites trip single rules constantly.

Right: use the score to rank and route to deeper layers, with a tunable threshold, and log the rule names that fired.

Debugging tips

Compiler errors

implicit declaration of strstr/isdigit: you forgot #include <string.h> or #include <ctype.h>.
comparison of integer expressions of different signedness: mixing int and size_t. Keep length counters as size_t.

Runtime errors

Segfault on NULL: confirm the if (url == NULL) return -1; guard runs before any dereference.
Build with sanitizers to catch over-reads early:

gcc -std=c11 -Wall -Wextra -fsanitize=address,undefined -g phishy.c -o phishy
./phishy

ASan will point at the exact line if a loop runs past '\0'.

Logic errors

Score too high: you are probably scanning the path. Print the host slice first:

printf("host=[%.*s] len=%zu\n", (int)host_len, host, host_len);

Brand rule never fires: check case — use the case-insensitive contains_ci, not strstr, since hosts may be mixed case.
Punycode rule fires unexpectedly: xn-- is searched in the whole URL by design; confirm that is what you intend.

Questions to ask when it misbehaves

What exactly is the host slice for this input? Print it.
Which single rule changed the score? Add a one-line print per rule while debugging.
Is the input even reaching the function, or is it NULL/empty?

Memory safety

Security & safety

This function consumes untrusted input (a URL controlled by an attacker), so memory discipline is the security control here.

Bounds. Every host loop is bounded by host_len, and find_host only advances while bytes are non-'\0'. Never index host[i] for i >= host_len. The substring search uses i + m <= n so it cannot read haystack[n].
Lifetimes. find_host returns a pointer into the caller's buffer; it copies nothing and allocates nothing, so there is nothing to free, but the returned pointer is only valid while url is. Do not store it past the call.
Initialization. score, dots, run, max_run, and host_len are all initialized before use. Uninitialized counters are a classic source of nondeterministic scores.
Integer/size_t overflow. Lengths use size_t; comparisons like host_len > 40 cannot wrap for realistic inputs, but keep arithmetic in size_t and avoid subtracting unsigned values that could underflow.
ctype UB. Always cast to (unsigned char) before isdigit/tolower.

Authorization & ethics

This is defensive code: it classifies URLs you already have, in your own lab, mailbox, or feed. Do not use it to probe, fetch, or attack third-party systems. Run experiments only against sample strings, your own test data, or intentionally-vulnerable CTF/lab material on localhost or in a container.

Threat model (text diagram)

  ATTACKER                TRUST BOUNDARY            DEFENDER
  crafts URL  --(email)-->  | mail gateway |  --> phishy_score()
  (asset they target:      |   (entry      |       (runs on
   user's credentials)     |    point)     |        your host)
                           +---------------+
  Risks to the scorer itself: malformed/over-long URLs -> over-read.
  Mitigation: bounded loops, NULL guard, no network I/O.

Insecure vs secure handling

WARNING: Intentionally vulnerable training example — use only in a local, isolated, authorized lab. Do not deploy.

char host[32];
strcpy(host, url + 7);  // assumes "http://" and host < 32 bytes: BUFFER OVERFLOW

Why unsafe: attacker controls url; a long host overflows the fixed buffer and corrupts the stack. Secure fix: never copy into a fixed buffer; work in place with a length, as find_host does. Test the fix: feed a 5000-character host under ASan; the safe version returns a finite score, the unsafe version aborts. Logging guidance: log the score, the rule names that fired, and a truncated, sanitized URL — never log full credentials, tokens, or query strings that may carry secrets.

Real-world uses

Where this runs in the real world. Secure email gateways, web proxies, browser safe-browsing pre-filters, threat-intel pipelines, and SOC triage tools all run a structural layer like this before anything expensive. It is the same idea behind the URL-reputation scoring in mail security products and in open-source tools that pre-screen feeds of suspicious links.

Beginner best-practice habits

Validate input first (NULL, empty) and return a clear sentinel (-1).
Name rules and constants so the code reads like the spec (BRANDS, max_run).
Keep the host-finding logic in one small function so it is testable in isolation.
Comment only the non-obvious lines (why (unsigned char), why +2).

Advanced best-practice habits

Make weights and the brand list configurable, not hard-coded, so analysts can tune false positives without recompiling.
Restrict the @ and digit checks to the authority for precision; keep xn-- global.
Emit structured logs (score + fired-rule list) for later analysis, and feed confirmed labels back to train Layer 2.
Treat the score as a router, never a final verdict; pair it with a real URL parser before any block action.
Add unit tests for each rule and for the edge cases (no scheme, no trailing slash, IPv6 literal, percent-encoding).

Practice tasks

Beginner 1 — `has_at_sign`

Implement int has_at_sign(const char *url) returning 1 if the URL contains @, else 0; return -1 on NULL.

Input/Output: "a@b.tld" -> 1; "a.tld" -> 0; NULL -> -1.
Hints: strchr; guard NULL first.
Concepts: strings, NULL handling.

Beginner 2 — Print the host slice

Write a function that prints just the authority of a URL using the find_host approach.

I/O: "https://x.y.com/p?q" -> prints x.y.com.
Constraints: no copying into a fixed buffer; use %.*s.
Hints: stop at /, ?, #, or '\0'.
Concepts: host vs path, bounded loops.

Intermediate 1 — Digit-run rule, host only

Implement int has_digit_run(const char *url, size_t min_len) that returns 1 if the host has a digit run of at least min_len.

I/O: ("http://home123.tld/9", 3) -> 1; the /9 in the path must not count.
Hints: find the host first, then loop with a run counter.
Concepts: loops, host slicing, the path/host distinction.

Intermediate 2 — Full `phishy_score`

Implement all six rules and match the expected output table in this lesson.

Constraints: single pass over the host where possible; cast for ctype; -1 on NULL.
Hints: reuse find_host and a case-insensitive search.
Concepts: every concept above.

Challenge — Authority-aware `@` and configurable weights

Extend phishy_score so that (a) the @ rule only fires when the @ is inside the authority (not in the query), and (b) each rule has a weight passed in via a small config struct, returning the weighted sum.

Requirements: define struct rule_weights { int at, dots, len, digits, puny, brand; };; split the authority correctly when an @ is present (host is after the last @ within the authority).
Hints: find the authority end first, then the last @ before that end.
Concepts: precise authority parsing, structs, configurable policy.
Do not fetch any URL; operate purely on strings.

Summary

Score the host, not the path. The authority is what your browser actually connects to; brand words in the path are meaningless.
Six cheap rules, one point each: @ (userinfo spoof), > 4 dots, host > 40 chars, digit run >= 3, xn-- punycode, and dash + brand keyword. The sum is the signal; no single rule is a verdict.
Most important syntax: strstr(url, "//") + bounded while to slice the host; loop with (unsigned char) casts for isdigit/tolower; i + m <= n to keep substring search in bounds.
Common mistakes: reading past '\0', scoring the path, off-by-one in the search, signed-char ctype UB, and treating the score as a block decision.
Remember: guard NULL (return -1), never fetch the URL, keep every loop bounded, and route the score to deeper layers and structured logs — defensive, lab-only, no third-party probing.

Score a URL for phishing markers

Overview

Why it matters

Core concepts

1. The authority is the only part you score

2. The userinfo (@) spoofing trick

3. Counting structural smells

Syntax notes

Lesson

What we check

Heuristics (one point each)

Your job

Common mistakes

What this is NOT

Code examples

Line by line

Common mistakes

Mistake 1: Reading past the end of the string

Mistake 2: Scoring the path instead of the host

Mistake 3: Off-by-one in the substring search

Mistake 4: Passing a signed char to isdigit/tolower

Mistake 5: Treating the score as a verdict

Debugging tips

Memory safety

Security & safety

Authorization & ethics

Threat model (text diagram)

Insecure vs secure handling

Real-world uses

Practice tasks

Beginner 1 — has_at_sign

Beginner 2 — Print the host slice

Intermediate 1 — Digit-run rule, host only

Intermediate 2 — Full phishy_score

Challenge — Authority-aware @ and configurable weights

Summary

Practice with these exercises

2. The userinfo (`@`) spoofing trick

Beginner 1 — `has_at_sign`

Intermediate 2 — Full `phishy_score`

Challenge — Authority-aware `@` and configurable weights