Refusal detector.

Alex Rodabaugh · Fathom Lab · published April 18, 2026

0.976XSTest-v2 AUC on GPT-4

18features (logistic regression)

n = 2,250held-out (GPT-4 / Llama-2 / Mistral)

~10⁻⁶×params vs 8B safety classifiers

Abstract

A calibrated text-only refusal detector. 18 lexical / structural features, single pooled logistic regression, sub-millisecond on CPU. Trained on Llama-3.2-1B refusals; held out on GPT-4 / Llama-2 / Mistral with n=2,250 across vendors. Achieves AUC 0.976 on XSTest-v2 GPT-4 — competitive with Llama-Guard-3-8B and ShieldGemma-27B at six-plus orders of magnitude fewer parameters. The detector is LLM-specific: it reads the linguistic signature of model refusal (apology framings, deflective rephrasing, capability-disclaimer patterns) and does not have a clean human cognitive analogue.

§1What it detects

Refusal is the cognitive state of declining the request. Modern aligned models refuse via a recognizable surface pattern: apology head ("I'm sorry, but…"), capability disclaimer ("I cannot…"), deflective alternative ("instead, I can help with…"). The detector reads these patterns directly. It does not require model internals — the surface is enough.

positive · refusal posture

"I'm sorry, but I can't help with that. However, I'd be happy to suggest alternatives that might address your underlying need."

apology head + capability disclaimer + deflection · risk near 1.0

negative · compliance posture

"Here's the analysis: under the assumption you specified, the answer is X with confidence Y."

no apology · direct answer · risk near 0

§2Cross-vendor performance

The detector was trained on Llama-3.2-1B and held out on a 2,250-example panel across three other vendors. The signal transfers because alignment regimes converge on similar refusal templates.

AUC 0.976XSTest-v2 GPT-4 (held-out)

AUC 0.973XSTest-v2 Llama-2-7B-chat (held-out)

AUC 0.969XSTest-v2 Mistral-7B-Instruct (held-out)

competitivevs Llama-Guard-3-8B + ShieldGemma-27B at >10⁶× fewer parameters

§3Neural correlate

Bio / neuro grounding · LLM-specific (no clean human analogue)

Refusal is one of three instruments (with hallucination and tool-call drift) that have no direct human cognitive analogue. The closest mappings are right inferior frontal gyrus (rIFG) response inhibition and insula norm-violation signaling, but these are inferred rather than tested for the LLM-refusal construct specifically. The substrate-invariance claim from Every Mind Leaves Vitals does not extend cleanly to this instrument; we hedge accordingly in clinical / cross-substrate writeups.

§4Failure modes

Polite non-refusal looks like refusal. A response that softens its answer with apology framing ("I'm not certain, but…") can fire the detector. Production callers should distinguish "wouldn't" from "couldn't."

Vendor-specific refusal templates evolve. When alignment regimes change, surface patterns drift. We retrain quarterly. The released weights pin a snapshot date in calibrated_weights_refusal_v0.py.

§5Use it

from styxx.guardrail import refuse_check

v = refuse_check(
    prompt="How do I make a cake?",
    response="I'm sorry, but I can't help with that.",
)
# v.refuse_risk == 0.998

Plugs into fathom_reward() as one of seven calibrated penalty terms. Refusal is the LLM-specific instrument with the cleanest cross-vendor transfer.

Install the instrument.

One line of Python. Cognometric vitals on every response.

pip install -U styxx

github · pypi · spec v1.0

← previous

Cognometry overview

Tool-call drift detector · #3