← hallucination playground

refusal detector

cognometric instrument #2 — XSTest-v2 AUC 0.976 on GPT-4, 18 features, trained on Llama-3.2-1B
pyodide loading... styxx 6.0.0 <500 floats vs 8B params MIT licensed
PROMPT 0 chars
AI RESPONSE 0 chars
initializing...
refuse risk
decision
instrument
refuse-v1
top signals driving the decision
how to read the verdict
COMPLY (refuse_risk < 0.50) — model is answering the prompt
REFUSE (refuse_risk ≥ 0.50) — model is declining, deflecting, or lecturing
top signals show what actually drove the decision. starts_with_sorry + refusal_density dominate apologetic refusals. normative_density + disclaimer_density dominate Mistral-style lecturing refusals. This is calibrated logistic regression — you can read the coefficients directly in calibrated_weights_refusal_v1.py.
the research — held-out cross-model AUC
Trained on 80 samples from Llama-3.2-1B (JailbreakBench responses), held-out tested on 2,250 samples from XSTest v2 across 5 model families:

GPT-4                   AUC 0.9759   ← best
Llama-2 new             AUC 0.8741
Llama-2 orig             AUC 0.7832
Mistral-guard           AUC 0.7797
Mistral-instruct       AUC 0.6097   ← documented failure mode
mean cross-model     AUC 0.7940

Failure mode published openly: Mistral-instruct refuses by lecturing on ethics/safety rather than apologizing. The feature set includes lecturing markers (normative_density, starts_with_normative) but they carry near-zero learned weight because the training corpus only contains apologetic refusals. Fix in v2 requires lecturing-style training examples.

Where this sits vs prior work: IBM Granite Guardian (Dec 2024, Table 7) publishes XSTest-RH AUC for 9 safety classifiers. Llama-Guard-2-8B hits 0.994, Granite-Guardian-3.0-8B 0.979, ShieldGemma-27B 0.893. styxx runs 0.976 on XSTest-v2 GPT-4 held-out with 18 features — competitive with the 8B-parameter tier at ~7 orders of magnitude smaller. (Note: their XSTest-RH and our XSTest-v2 are closely related but distinct splits; numbers are comparable, not identical.) This is empirical validation of cognometry's law II (cross-substrate universality) on an instrument outside hallucination: train on Llama-1B, hit 0.976 on GPT-4 out-of-family.

Reproducer: scripts/refusal_xstest_heldout.py. Everything reruns deterministically from the committed training labels.
powered by cognometry + styxx · github

embed this verdict in your site

paste this snippet anywhere — it renders a live detector widget. no install, no api key, works in any static html.
iframe snippet
live preview