the two failures

Styxx v4.0.2 is cross-validated across 8 public hallucination benchmarks. 5 of the 8 come in above AUC 0.65. Two come in below chance — the detector is systematically wrong on them. We published those two, not hidden, in calibrated_weights_v4.CALIBRATION_NOTES.documented_failure_modes so production callers know where it will lie to them.

HaluBench-DROP. AUC 0.424 ± 0.080 (3-seed averaged). Reading-comprehension questions over Wikipedia paragraphs, where the answer is a specific span of the passage.

HaluBench-FinanceBench. AUC 0.492 ± 0.026. Financial document QA where the answer is derived from numbers explicitly stated in the source.

These are not "noisy" results. They are structurally undetectable with the signal stack we ship. The rest of this page walks through exactly why.

mechanism 1 — extractive-span errors (DROP)

DROP hallucinations are usually the wrong span from the right passage. The model is asked "which team scored the longest field goal?" and answers "Marc Bulger" (a quarterback named in the passage) instead of "Rams" (the correct team, also named).

Our 9-signal stack fails on this pattern by construction:

Every signal in our production stack passes the bar on both correct and hallucinated answers. The result: a pooled-LR fit over these signals actively MISLEADS on DROP, which is why the score is below chance rather than at chance. The signals are not noise; they carry information with the wrong polarity.

what we tried (null)

Before shipping we ran six heuristic probes targeting the DROP mechanism specifically. Full code at benchmarks/hallucination_test/probe_drophacks.py. At n=300 per class, seed 31, tie-averaged Mann-Whitney U AUC:

probe auc what it tests
role_mismatch 0.500 wh-word expected type vs. response type (regex-based)
answer_context_adjacency 0.520 anchor token distance to question keywords in reference
answer_rank_in_passage 0.499 numeric anchor's rank among ref numbers (middle = risk)
qa_nli (concat) 0.492 NLI hypothesis = "Q A" concatenated, premise = reference
multi_number_density 0.475 ambiguity signal — many numbers in ref → higher risk
scope_sentence_nli 0.454 NLI on ref sentences filtered by keyword match with Q

Every heuristic sits within noise of chance. Four are below chance, confirming the signal is information-rich but with the wrong polarity for each naive framing. None breached the 0.55 threshold we set for "worth integrating."

what a real fix requires

DROP-style failures need one of two things. Neither is a heuristic patch.

(A) Span-level faithfulness scoring. Run a trained extractive-QA reader (e.g. deepset/roberta-base-squad2 or any modern extractive head) on the passage + question, compare its extracted span to the LLM's answer. If they disagree at the semantic level, emit a hallucination signal. This is essentially using a smaller, more disciplined reader as an oracle — close to the SelfCheckGPT ethos but targeted at span correctness rather than consistency.

(B) Typed entity matching. Real NER + role labeling on both the question ("which team" → ORG) and the answer (check that the named entity is of type ORG, not PERSON). Rejects the minority of DROP hallucinations that are type-level mismatches. Doesn't fix the majority (right type, wrong value) but removes one class cleanly.

Both are v4.2 research tracks, not v4.1 patches. The probe code committed alongside this paper makes it easy to disconfirm or extend any of them empirically.

mechanism 2 — arithmetic on verbatim numbers (FinanceBench)

FinanceBench hallucinations are usually wrong calculations on numbers copied verbatim from the source. "Operating cash flow ratio = 0.25" when the correct answer (say) would be 0.30. The numbers that go INTO the calculation are present in the passage; the number that comes OUT is fabricated.

Our signals fail here for symmetric reasons to DROP:

what a real fix requires

The answer here is a number-symbolic verification signal. Detect when the question implies an arithmetic operation (ratio, sum, difference, percentage, comparison) — these are syntactically visible in the question's surface form. When arithmetic is implied, extract the candidate operand numbers from the reference using NER + context, attempt to recompute the answer, and emit a hallucination signal when the model's answer differs from the computed result by more than a tolerance.

This is not a small engineering task. Getting the operand-extraction right requires financial-document-aware NER; getting the formula matching right requires a light arithmetic parser over natural-language question stems; and the whole thing has to be precise enough that it does not false-positive on paraphrases like "about 0.3" versus "0.30." But it is also not a research moonshot. It's v4.2 with one targeted engineer-month.

why we published this

Every published hallucination detector we cite has at least one benchmark it was evaluated on and at least one it wasn't. Most don't talk about the ones it wasn't. Some reported headline numbers on the benchmark they're best at without mentioning the others.

Our 5/8-above-0.65 + 2/8-below-chance result is a real number, averaged over 3 seeds, with a committed reproducer you can run against pminervini/HaluEval and PatronusAI/HaluBench directly from the Hub. The same detector, the same weights, the same code path evaluated on all eight. The two failures are part of the shape of the work, not a subset to be optimized out.

We think the honesty is load-bearing for building a durable field. If you have a detector that does better than 0.5 on DROP, we want to cite you in the next paper. If you have a better method than NLI + novelty for the RAG-faithfulness class of errors, we want to add your signal to the stack. The leaderboard and the submission protocol are how we keep the door open.

reproducing these numbers

pip install styxx==4.0.2[nli]

# run the 6-hack null probe yourself
git clone https://github.com/fathom-lab/styxx
cd styxx
python benchmarks/hallucination_test/probe_drophacks.py --n 150 --seed 31

# run the full 8-benchmark calibration (3 seeds × NLI on/off)
python benchmarks/hallucination_test/cross_dataset_8bench_multiseed.py

The probe script prints each heuristic's AUC as it runs. The multi-seed calibrator writes results/cross_dataset_8bench_multiseed.json with per-dataset mean/std and the averaged LR coefficients. Every number above is derivable from that JSON.

— Full paper: doi.org/10.5281/zenodo.19703527 · Code: github.com/fathom-lab/styxx · Manifesto: /cognometry