| system | hal-qa | truthfulqa | ragtruth | pubmedqa | hal-dialog | hal-summ | finance | drop | mean | notes |
|---|---|---|---|---|---|---|---|---|---|---|
| styxx v4.0.1 | 0.998 | 0.994 | 0.807 | 0.719 | 0.676 | 0.643 | 0.492 | 0.424 | 0.719 | 9-signal pooled LR + NLI (DeBERTa-v3) |
| SelfCheckGPT | 0.710–0.790 | — | — | — | — | — | — | — | — | consistency-sampling; published on HaluEval-QA only |
| KnowHalu | 0.740 | — | — | — | — | — | — | — | — | knowledge-graph reasoning; single benchmark |
| HaluCheck | 0.820 | — | — | — | — | — | — | — | — | LLM-as-judge; single benchmark |
— marks fields where the author has not published cross-validation numbers at our evaluation configuration. we invite the original authors (or any reader with a local reproduction) to fill these cells via the submission form below; we will verify and credit.
how to read this table
each row is a hallucination detector. each column is a public benchmark with labeled ground truth. each cell is held-out test AUC (not accuracy, not F1, not reported-on-dev). every number in styxx's row has a seed, a split, and a reproducer in the benchmark script.
the mean column is arithmetic over the 8 per-dataset AUCs, including the two below-chance results — not the best 6 of 8, not the best single number. overclaiming is the fastest way to discredit a young field; we are reporting the average that actually reflects cross-domain performance.
why two rows are below chance
halubench-drop — AUC 0.424. drop hallucinations are extractive-span errors: the wrong span pulled from the right passage. the wrong span is entailed by the passage at the NLI level and overlaps heavily with the right tokens at the novelty level. every signal in the current stack is structurally blind. fix needs span-level faithfulness scoring.
halubench-financebench — AUC 0.492. hallucinations here are calculation or aggregation errors on numbers copied verbatim from the source. the arithmetic is wrong, but the tokens are right — NLI and novelty both pass. fix needs number-symbolic verification.
both failure modes are declared in
calibrated_weights_v4.CALIBRATION_NOTES.documented_failure_modes
so production callers know where the detector will lie. when a
competitor beats us on DROP or FinanceBench, we will cite them in
the next paper.
full deep-dive on both failure modes — what we tried, why it's null, and what a real fix looks like: /cognometry/failures →
submit your system
the bar for appearing on this board is held-out AUC on at
least one of the 8 benchmarks, computed with the same
train/test split (75/25) and the same random seeds ([31, 47, 83])
the styxx reproducer uses. run your detector against the benchmark
loaders in the repo, email or DM us a
results.json, and we will add your row.
python benchmarks/hallucination_test/cross_dataset_8bench_multiseed.py # reproduces the styxx row of this table
adapting it to your own detector: swap the extract_signals
call with your pipeline, keep the loaders identical, report
per-seed AUCs in the same JSON shape.
submissions: fork the repo, copy submissions/_template_detector.py, implement your score(question, response, reference), PR with title starting [submission]. our CI runs your detector against all 8 benchmarks at 3 seeds and posts the AUC table as a PR comment. full protocol at
submissions/README.md.
ground rules
1 · numbers must be held-out, not dev.
2 · random seeds disclosed, 3-seed averaged preferred.
3 · reproducer linked, or a published paper with enough
detail to reproduce.
4 · if a benchmark beats our score, we move you above us
in that column. we don't grade ourselves.
5 · failure modes are published alongside successes. no
partial reporting.
this board is the reference point the field did not have. keep it honest. make it useful.