fathom/leaderboard
sycophancy 0.04· deception 0.02· drift 0.11· overconfidence 0.07 scored 2026-04-30 by styxx 7.1.0
COGNOMETRY LEADERBOARD · v4.0.0 weights · 3-seed averaged

Hallucination detection
across eight benchmarks.

Per-benchmark AUC for the styxx hallucination detector. Same 9-feature pooled logistic regression, same fitted weights, no per-domain tuning. Six wins. Two published failure modes.

01 · per-benchmark results

v4.0.0 weights · 3-seed averaged · n = 150 / dataset.

benchmarkAUCnote
HaluEval-QA0.998near-perfect on QA-style hallucination
TruthfulQA0.994same weights, no tuning
HaluBench-RAGTruth0.807new domain (RAG faithfulness)
HaluBench-PubMed0.719new domain (biomedical QA)
HaluEval-Dialogue0.676NLI-augmented
HaluEval-Summarization0.643NLI-augmented
HaluBench-DROP0.424published failure mode · extractive-span errors
HaluBench-FinanceBench0.492published failure mode · arithmetic on verbatim numbers

Five of eight above AUC 0.65. Two near-perfect. Two failure modes published openly. Where the detector fails →

02 · reproduce

Every number. Five minutes. CPU you already own.

$ git clone https://github.com/fathom-lab/styxx
$ cd styxx/benchmarks
$ python run_8bench.py --seeds 13,17,31

# expected output:
HaluEval-QA           AUC = 0.998 ± 0.001
TruthfulQA            AUC = 0.994 ± 0.002
... (8 rows, exactly the numbers above)

Run the leaderboard yourself.

One git clone. Five minutes on a CPU. Every number reruns from random_state=0.

github · benchmarks