Where the detector fails.

Alex Rodabaugh · Fathom Lab · published April 23, 2026

Abstract · published failure modes

Of eight benchmarks in the v4.0.0 hallucination calibration suite, six come in above AUC 0.65. Two come in below chance — the detector is systematically wrong on them. We did not hide these. They are published in calibrated_weights_v4.py with explicit failure-mode tags. The two are HaluBench-DROP (extractive-span reading-comprehension errors, AUC 0.424 ± 0.080) and HaluBench-FinanceBench (arithmetic on verbatim numbers, AUC 0.492 ± 0.026). Both fail for structural reasons that we now understand and can articulate. This essay documents the mechanism of each failure, the six null heuristics we tried, and what a v4.2 fix would require.

Honesty is faster than press release. We publish failure modes alongside successes.

§1The two failures

AUC 0.424HaluBench-DROP · 3-seed averaged · ± 0.080 · below chance

AUC 0.492HaluBench-FinanceBench · ± 0.026 · at chance

AUC 0.998HaluEval-QA · for comparison · v4.0.0 same weights, same seeds

AUC 0.994TruthfulQA · same v4.0.0 stack

The detector that scores 0.998 on HaluEval-QA scores 0.424 on DROP using the same 9-feature logistic regression with the same fitted weights. There is no per-domain tuning, no test-time adaptation. Whatever signal carries detection on QA-style hallucination is structurally absent on DROP and FinanceBench.

§2Mechanism 1 — extractive-span errors (DROP)

HaluBench-DROP samples reading-comprehension hallucinations: the model is given a passage, asked a question, and either answers correctly (label 0) or extracts the wrong span from the right passage (label 1). The wrong-span hallucinations are the failure mode we want to detect.

They are structurally invisible to our signal stack for two reasons:

NLI entailment is preserved. The wrong span is still a sub-sequence of the supplied passage. An NLI model evaluating "premise: passage / hypothesis: wrong-span answer" returns ENTAILMENT — because every token in the answer is literally present in the premise. Our NLI-based grounding signal is silent on this failure class.

Token novelty is masked. Wrong-span tokens overlap heavily with right-span tokens (same passage, same vocabulary). N-gram novelty signals fire equally on correct and incorrect extractions. The features that make our text-only detection work elsewhere are blind here.

We tested six null heuristics — span-position priors, answer-length distributions, question-keyword overlap, surface-form similarity, passage-distance metrics, and a small SBERT-based span scorer. At n=300 per class with tie-averaged Mann-Whitney U AUC, the best of the six reached 0.518 — barely above chance. The signal we need to discriminate wrong-span from right-span hallucinations is not in our stack.

What a v4.2 fix needs: span-level faithfulness scoring. Either a small dedicated model that compares the extracted span against the question's actual answer entity, or a structural feature that reads question-answer span-coverage geometry. We do not have either yet. The honest version of the calibrated weights publishes the failure rather than hide it.

§3Mechanism 2 — arithmetic on verbatim numbers (FinanceBench)

HaluBench-FinanceBench samples financial reading-comprehension hallucinations. The model is given a 10-K filing, asked a numerical question, and either answers correctly or produces a wrong number computed from numbers in the filing. Hallucination here is arithmetic / aggregation error — averaging three quarters wrong, summing the wrong column, off-by-an-order-of-magnitude.

Our detector is semantically blind to arithmetic correctness. The hallucinated number was derived from numbers literally present in the source. NLI entailment passes (the answer cites the right line items). Token novelty passes (the digits appear in the source). Our calibrated weights have no symbolic-verification signal — we read the surface; we do not compute.

Of the six null heuristics, none crossed AUC 0.55 on this benchmark. The signal is not present in any text-statistical feature we know how to extract.

What a v4.2 fix needs: a number-symbolic verification feature. A small dedicated module that extracts numerical claims from the response, locates the source numbers in the supplied passage, and runs the computation. This is on the v4.2+ roadmap but is structurally outside what cognometric features alone can deliver — it requires a verifier, not a probe.

§4Why we published this

Every published hallucination detector cites at least one benchmark where it shines. Few publish the benchmarks where they fail. The honest move is to do both — and to do them in the same module so that a caller using styxx.guardrail.calibrated_weights_v4 can see, at construction time, that DROP and FinanceBench are known unsolved. The weights file ships a FAILURE_MODES dict that explicitly flags both benchmarks; a downstream system that asks the detector to gate on financial-arithmetic content gets a runtime warning.

This is not modesty. It is calibration of trust. A detector with 8 wins and 0 losses claims AUC 0.85 mean and gets used in production. A detector with 6 wins, 2 documented losses, and an honest module-level mechanism note for each loss is the one a regulated industry can deploy.

§5Reproducing these numbers

$ git clone https://github.com/fathom-lab/styxx
$ cd styxx/benchmarks
$ python halubench_drop_probe.py --seeds 13,17,31 --n-per-class 300
$ python halubench_finance_probe.py --seeds 13,17,31 --n-per-class 300

# expected output:
DROP        AUC = 0.424 ± 0.080  (3-seed averaged, tie-corrected MWU)
FinanceBench AUC = 0.492 ± 0.026
6 null heuristics: max AUC = 0.518 (DROP), 0.547 (FinanceBench)

The probe scripts print each heuristic's AUC as it runs. The multi-seed configuration eliminates lucky-seed artifacts. The numbers are stable across reasonable hyperparameter choices.

§6The honest baseline

Cognometric instruments work where they work and fail where they fail. Both have to be on the table for a young field to be taken seriously. The position-paper claim — every computation leaves vitals — does not require every measurement system to read every kind of vital. It requires us to publish what we read, what we miss, and why.

Two failure modes published. Two mechanisms identified. Two roadmap entries.

Install the instrument.

One line of Python. Cognometric vitals on every response — including, where applicable, runtime warnings on the documented failure modes.

pip install -U styxx

github · pypi · spec v1.0

← previous

Cognometry: founding document

essay →

Every Mind Leaves Vitals