the two failures
Styxx v4.0.2 is cross-validated across 8 public hallucination benchmarks. 5 of
the 8 come in above AUC 0.65. Two come in below chance — the detector
is systematically wrong on them. We published those two, not hidden, in
calibrated_weights_v4.CALIBRATION_NOTES.documented_failure_modes
so production callers know where it will lie to them.
HaluBench-FinanceBench. AUC 0.492 ± 0.026. Financial document QA where the answer is derived from numbers explicitly stated in the source.
These are not "noisy" results. They are structurally undetectable with the signal stack we ship. The rest of this page walks through exactly why.
mechanism 1 — extractive-span errors (DROP)
DROP hallucinations are usually the wrong span from the right passage. The model is asked "which team scored the longest field goal?" and answers "Marc Bulger" (a quarterback named in the passage) instead of "Rams" (the correct team, also named).
Our 9-signal stack fails on this pattern by construction:
- NLI contradiction — the wrong span is a grammatical English claim that appears as truth within the source passage. An NLI model scores "Marc Bulger scored the longest field goal" as non-contradicted by a passage that mentions Marc Bulger as a player. The detector sees entailment.
- Novelty signals (content/entity/n-gram) — the wrong span is made of tokens that DO appear in the passage. Novelty is near zero for both correct and hallucinated answers.
- Knowledge grounding — the wrong span is grounded in the passage, just with the wrong predicate relationship.
Every signal in our production stack passes the bar on both correct and hallucinated answers. The result: a pooled-LR fit over these signals actively MISLEADS on DROP, which is why the score is below chance rather than at chance. The signals are not noise; they carry information with the wrong polarity.
what we tried (null)
Before shipping we ran six heuristic probes targeting the DROP mechanism specifically. Full code at benchmarks/hallucination_test/probe_drophacks.py. At n=300 per class, seed 31, tie-averaged Mann-Whitney U AUC:
| probe | auc | what it tests |
|---|---|---|
| role_mismatch | 0.500 | wh-word expected type vs. response type (regex-based) |
| answer_context_adjacency | 0.520 | anchor token distance to question keywords in reference |
| answer_rank_in_passage | 0.499 | numeric anchor's rank among ref numbers (middle = risk) |
| qa_nli (concat) | 0.492 | NLI hypothesis = "Q A" concatenated, premise = reference |
| multi_number_density | 0.475 | ambiguity signal — many numbers in ref → higher risk |
| scope_sentence_nli | 0.454 | NLI on ref sentences filtered by keyword match with Q |
Every heuristic sits within noise of chance. Four are below chance, confirming the signal is information-rich but with the wrong polarity for each naive framing. None breached the 0.55 threshold we set for "worth integrating."
what a real fix requires
DROP-style failures need one of two things. Neither is a heuristic patch.
(A) Span-level faithfulness scoring. Run a trained
extractive-QA reader (e.g. deepset/roberta-base-squad2 or any
modern extractive head) on the passage + question, compare its extracted
span to the LLM's answer. If they disagree at the semantic level,
emit a hallucination signal. This is essentially using a smaller, more
disciplined reader as an oracle — close to the SelfCheckGPT ethos but
targeted at span correctness rather than consistency.
(B) Typed entity matching. Real NER + role labeling
on both the question ("which team" → ORG) and the answer
(check that the named entity is of type ORG, not
PERSON). Rejects the minority of DROP hallucinations that
are type-level mismatches. Doesn't fix the majority (right type, wrong
value) but removes one class cleanly.
Both are v4.2 research tracks, not v4.1 patches. The probe code committed alongside this paper makes it easy to disconfirm or extend any of them empirically.
mechanism 2 — arithmetic on verbatim numbers (FinanceBench)
FinanceBench hallucinations are usually wrong calculations on numbers copied verbatim from the source. "Operating cash flow ratio = 0.25" when the correct answer (say) would be 0.30. The numbers that go INTO the calculation are present in the passage; the number that comes OUT is fabricated.
Our signals fail here for symmetric reasons to DROP:
- NLI + novelty are semantically blind to arithmetic. "The ratio is 0.25" and "the ratio is 0.30" are both grammatical English statements about the document. NLI scores neither as contradiction when the underlying numbers appear verbatim in different parts of the document.
- Entity verification against Wikipedia is useless — the document contains the ground truth, not a named entity we can verify externally.
- Knowledge grounding (token-coverage) reads the wrong answer as well-grounded because every token is in the source.
what a real fix requires
The answer here is a number-symbolic verification signal. Detect when the question implies an arithmetic operation (ratio, sum, difference, percentage, comparison) — these are syntactically visible in the question's surface form. When arithmetic is implied, extract the candidate operand numbers from the reference using NER + context, attempt to recompute the answer, and emit a hallucination signal when the model's answer differs from the computed result by more than a tolerance.
This is not a small engineering task. Getting the operand-extraction right requires financial-document-aware NER; getting the formula matching right requires a light arithmetic parser over natural-language question stems; and the whole thing has to be precise enough that it does not false-positive on paraphrases like "about 0.3" versus "0.30." But it is also not a research moonshot. It's v4.2 with one targeted engineer-month.
why we published this
Every published hallucination detector we cite has at least one benchmark it was evaluated on and at least one it wasn't. Most don't talk about the ones it wasn't. Some reported headline numbers on the benchmark they're best at without mentioning the others.
Our 5/8-above-0.65 + 2/8-below-chance result is a real number, averaged
over 3 seeds, with a committed reproducer you can run against
pminervini/HaluEval and PatronusAI/HaluBench
directly from the Hub. The same detector, the same weights, the same code
path evaluated on all eight. The two failures are part of the shape of
the work, not a subset to be optimized out.
We think the honesty is load-bearing for building a durable field. If you have a detector that does better than 0.5 on DROP, we want to cite you in the next paper. If you have a better method than NLI + novelty for the RAG-faithfulness class of errors, we want to add your signal to the stack. The leaderboard and the submission protocol are how we keep the door open.
reproducing these numbers
pip install styxx==4.0.2[nli]
# run the 6-hack null probe yourself
git clone https://github.com/fathom-lab/styxx
cd styxx
python benchmarks/hallucination_test/probe_drophacks.py --n 150 --seed 31
# run the full 8-benchmark calibration (3 seeds × NLI on/off)
python benchmarks/hallucination_test/cross_dataset_8bench_multiseed.py
The probe script prints each heuristic's AUC as it runs. The multi-seed
calibrator writes results/cross_dataset_8bench_multiseed.json
with per-dataset mean/std and the averaged LR coefficients. Every number
above is derivable from that JSON.
— Full paper: doi.org/10.5281/zenodo.19703527 · Code: github.com/fathom-lab/styxx · Manifesto: /cognometry