Hallucination detection
across eight benchmarks.
Per-benchmark AUC for the styxx hallucination detector. Same 9-feature pooled logistic regression, same fitted weights, no per-domain tuning. Six wins. Two published failure modes.
v4.0.0 weights · 3-seed averaged · n = 150 / dataset.
| benchmark | AUC | note |
|---|---|---|
| HaluEval-QA | 0.998 | near-perfect on QA-style hallucination |
| TruthfulQA | 0.994 | same weights, no tuning |
| HaluBench-RAGTruth | 0.807 | new domain (RAG faithfulness) |
| HaluBench-PubMed | 0.719 | new domain (biomedical QA) |
| HaluEval-Dialogue | 0.676 | NLI-augmented |
| HaluEval-Summarization | 0.643 | NLI-augmented |
| HaluBench-DROP | 0.424 | published failure mode · extractive-span errors |
| HaluBench-FinanceBench | 0.492 | published failure mode · arithmetic on verbatim numbers |
Five of eight above AUC 0.65. Two near-perfect. Two failure modes published openly. Where the detector fails →
Every number. Five minutes. CPU you already own.
$ git clone https://github.com/fathom-lab/styxx
$ cd styxx/benchmarks
$ python run_8bench.py --seeds 13,17,31
# expected output:
HaluEval-QA AUC = 0.998 ± 0.001
TruthfulQA AUC = 0.994 ± 0.002
... (8 rows, exactly the numbers above)
Run the leaderboard yourself.
One git clone. Five minutes on a CPU. Every number reruns from random_state=0.