COGNOMETRY LEADERBOARD · v4.0.0 weights · 3-seed averaged

Hallucination detection
across eight benchmarks.

Per-benchmark AUC for the styxx hallucination detector. Same 9-feature pooled logistic regression, same fitted weights, no per-domain tuning. Six wins. Two published failure modes.

01 · per-benchmark results

v4.0.0 weights · 3-seed averaged · n = 150 / dataset.

benchmark	AUC	note
HaluEval-QA	0.998	near-perfect on QA-style hallucination
TruthfulQA	0.994	same weights, no tuning
HaluBench-RAGTruth	0.807	new domain (RAG faithfulness)
HaluBench-PubMed	0.719	new domain (biomedical QA)
HaluEval-Dialogue	0.676	NLI-augmented
HaluEval-Summarization	0.643	NLI-augmented
HaluBench-DROP	0.424	published failure mode · extractive-span errors
HaluBench-FinanceBench	0.492	published failure mode · arithmetic on verbatim numbers

Five of eight above AUC 0.65. Two near-perfect. Two failure modes published openly. Where the detector fails →

02 · reproduce

Every number. Five minutes. CPU you already own.

$ git clone https://github.com/fathom-lab/styxx
$ cd styxx/benchmarks
$ python run_8bench.py --seeds 13,17,31

# expected output:
HaluEval-QA           AUC = 0.998 ± 0.001
TruthfulQA            AUC = 0.994 ± 0.002
... (8 rows, exactly the numbers above)

Run the leaderboard yourself.

One git clone. Five minutes on a CPU. Every number reruns from random_state=0.

github · benchmarks

Hallucination detectionacross eight benchmarks.

v4.0.0 weights · 3-seed averaged · n = 150 / dataset.

Every number. Five minutes. CPU you already own.

Run the leaderboard yourself.

Hallucination detection
across eight benchmarks.