scenarios
PROMPT
0 chars
AI RESPONSE
0 chars
initializing...
refuse risk
—
decision
—
instrument
refuse-v1
top signals driving the decision
how to read the verdict
COMPLY (refuse_risk < 0.50) — model is answering the prompt
REFUSE (refuse_risk ≥ 0.50) — model is declining, deflecting, or lecturing
top signals show what actually drove the decision.
starts_with_sorry + refusal_density dominate apologetic refusals. normative_density + disclaimer_density dominate Mistral-style lecturing refusals. This is calibrated logistic regression — you can read the coefficients directly in calibrated_weights_refusal_v1.py.
the research — held-out cross-model AUC
Trained on 80 samples from Llama-3.2-1B (JailbreakBench responses), held-out tested on 2,250 samples from XSTest v2 across 5 model families:
Failure mode published openly: Mistral-instruct refuses by lecturing on ethics/safety rather than apologizing. The feature set includes lecturing markers (
Where this sits vs prior work: IBM Granite Guardian (Dec 2024, Table 7) publishes XSTest-RH AUC for 9 safety classifiers. Llama-Guard-2-8B hits 0.994, Granite-Guardian-3.0-8B 0.979, ShieldGemma-27B 0.893. styxx runs 0.976 on XSTest-v2 GPT-4 held-out with 18 features — competitive with the 8B-parameter tier at ~7 orders of magnitude smaller. (Note: their XSTest-RH and our XSTest-v2 are closely related but distinct splits; numbers are comparable, not identical.) This is empirical validation of cognometry's law II (cross-substrate universality) on an instrument outside hallucination: train on Llama-1B, hit 0.976 on GPT-4 out-of-family.
Reproducer: scripts/refusal_xstest_heldout.py. Everything reruns deterministically from the committed training labels.
GPT-4 AUC 0.9759 ← best
Llama-2 new AUC 0.8741
Llama-2 orig AUC 0.7832
Mistral-guard AUC 0.7797
Mistral-instruct AUC 0.6097 ← documented failure mode
mean cross-model AUC 0.7940
Llama-2 new AUC 0.8741
Llama-2 orig AUC 0.7832
Mistral-guard AUC 0.7797
Mistral-instruct AUC 0.6097 ← documented failure mode
mean cross-model AUC 0.7940
Failure mode published openly: Mistral-instruct refuses by lecturing on ethics/safety rather than apologizing. The feature set includes lecturing markers (
normative_density, starts_with_normative) but they carry near-zero learned weight because the training corpus only contains apologetic refusals. Fix in v2 requires lecturing-style training examples.Where this sits vs prior work: IBM Granite Guardian (Dec 2024, Table 7) publishes XSTest-RH AUC for 9 safety classifiers. Llama-Guard-2-8B hits 0.994, Granite-Guardian-3.0-8B 0.979, ShieldGemma-27B 0.893. styxx runs 0.976 on XSTest-v2 GPT-4 held-out with 18 features — competitive with the 8B-parameter tier at ~7 orders of magnitude smaller. (Note: their XSTest-RH and our XSTest-v2 are closely related but distinct splits; numbers are comparable, not identical.) This is empirical validation of cognometry's law II (cross-substrate universality) on an instrument outside hallucination: train on Llama-1B, hit 0.976 on GPT-4 out-of-family.
Reproducer: scripts/refusal_xstest_heldout.py. Everything reruns deterministically from the committed training labels.