8
benchmarks
4
systems listed
3-seed
mean reported
2 / 8
declared failure modes
MIT+CCBY
reference reproducer
system hal-qa truthfulqa ragtruth pubmedqa hal-dialog hal-summ finance drop mean notes
styxx v4.0.1 0.998 0.994 0.807 0.719 0.676 0.643 0.492 0.424 0.719 9-signal pooled LR + NLI (DeBERTa-v3)
SelfCheckGPT 0.710–0.790 consistency-sampling; published on HaluEval-QA only
KnowHalu 0.740 knowledge-graph reasoning; single benchmark
HaluCheck 0.820 LLM-as-judge; single benchmark

— marks fields where the author has not published cross-validation numbers at our evaluation configuration. we invite the original authors (or any reader with a local reproduction) to fill these cells via the submission form below; we will verify and credit.

how to read this table

each row is a hallucination detector. each column is a public benchmark with labeled ground truth. each cell is held-out test AUC (not accuracy, not F1, not reported-on-dev). every number in styxx's row has a seed, a split, and a reproducer in the benchmark script.

the mean column is arithmetic over the 8 per-dataset AUCs, including the two below-chance results — not the best 6 of 8, not the best single number. overclaiming is the fastest way to discredit a young field; we are reporting the average that actually reflects cross-domain performance.

why two rows are below chance

halubench-drop — AUC 0.424. drop hallucinations are extractive-span errors: the wrong span pulled from the right passage. the wrong span is entailed by the passage at the NLI level and overlaps heavily with the right tokens at the novelty level. every signal in the current stack is structurally blind. fix needs span-level faithfulness scoring.

halubench-financebench — AUC 0.492. hallucinations here are calculation or aggregation errors on numbers copied verbatim from the source. the arithmetic is wrong, but the tokens are right — NLI and novelty both pass. fix needs number-symbolic verification.

both failure modes are declared in calibrated_weights_v4.CALIBRATION_NOTES.documented_failure_modes so production callers know where the detector will lie. when a competitor beats us on DROP or FinanceBench, we will cite them in the next paper.

full deep-dive on both failure modes — what we tried, why it's null, and what a real fix looks like: /cognometry/failures

submit your system

the bar for appearing on this board is held-out AUC on at least one of the 8 benchmarks, computed with the same train/test split (75/25) and the same random seeds ([31, 47, 83]) the styxx reproducer uses. run your detector against the benchmark loaders in the repo, email or DM us a results.json, and we will add your row.

python benchmarks/hallucination_test/cross_dataset_8bench_multiseed.py
# reproduces the styxx row of this table

adapting it to your own detector: swap the extract_signals call with your pipeline, keep the loaders identical, report per-seed AUCs in the same JSON shape.

submissions: fork the repo, copy submissions/_template_detector.py, implement your score(question, response, reference), PR with title starting [submission]. our CI runs your detector against all 8 benchmarks at 3 seeds and posts the AUC table as a PR comment. full protocol at submissions/README.md.

ground rules

1 · numbers must be held-out, not dev.
2 · random seeds disclosed, 3-seed averaged preferred.
3 · reproducer linked, or a published paper with enough detail to reproduce.
4 · if a benchmark beats our score, we move you above us in that column. we don't grade ourselves.
5 · failure modes are published alongside successes. no partial reporting.

this board is the reference point the field did not have. keep it honest. make it useful.