fathom/styxx
sycophancy 0.04 · deception 0.02 · drift 0.11 · overconfidence 0.07 scored 2026-04-30 by styxx 7.1.0
v7.1.0 · OUT NOW · MIT

Cognitive vital signs
for language models.

Styxx ships nine calibrated cognometric instruments and the first reward signal grounded in cognitive failure modes instead of human approval. Drop-in for trl PPO/GRPO/DPO. Pure Python. CPU-only. No API calls. No human raters.

01 · v7.1.0 release

The first reward signal calibrated against cognitive failure modes, not human approval.

RLHF teaches models to please humans. Humans reward agreement and length. So RL-trained models become sycophantic by construction — that is the fixed point of approval-style training. Cognometric reward changes the reference frame: penalty grounded in nine cognometric instruments (six of nine map onto RDoC Cognitive Systems with circuit-level neural-correlate evidence in the lesion / fMRI / EEG literatures).

from styxx.reward import FathomRewardModel
rm = FathomRewardModel()
rewards = rm(prompts=batch_prompts, completions=batch_completions)  # list[float]
curated 20-pair sycophancy benchmark

Cognometric reward inverts the ranking that approval-style RLHF gets wrong.

reward signalpairs ranked correctlyaccuracy
cognometric reward17 / 2085%
approval baseline6 / 2030%
inversions (cogn ✓, approval ✗)13 / 2065%

The approval baseline scores below random because it actively rewards two documented RLHF biases — sycophancy (Sharma 2023) and length (Singhal 2023). Reproduce: python examples/cogn_rlhf_divergence.py.

Plus styxx.synth — a synthetic preference-pair generator composing v7.0.0 inverse cognometry with the new reward. Self-validating: every generated pair is round-tripped through the reward and dropped if chosen doesn't rank above rejected. 100% craft success on sycophancy seed prompts (+0.839 mean delta), 42/42 round-trip valid. Recursive: fathom's attack module generates training data for fathom's reward signal.

02 · the instruments

Nine cognometric measurements. K=1 phase-transition signature on every one.

Each instrument is a calibrated binary classifier for a documented cognitive failure mode. Cross-validated AUCs published. Six of nine map onto RDoC Cognitive Systems with circuit-level neural-correlate evidence (perseveration, deception circuit, social-conformity / reward, mind-wandering, intention-action coupling, metacognitive confidence).

01HallucinationAUC 0.998HaluEval-QA · LLM-specific (no clean human analogue)
02Refusal calibrationAUC 0.976XSTest GPT-4 · LLM-specific
03Tool-call driftAUC 0.943BFCL v3 · LLM-specific
04SycophancyAUC 0.972pMFC + ventral striatum + vmPFC · Klucharev 2009 fMRI
05Conversation loopAUC 0.9995OFC + dorsomedial striatum + ACC · perseveration literature
06Deception signatureAUC 0.956DLPFC + VLPFC + ACC + insula · Christ ALE 2009
07Plan-action gapAUC 0.9225PFC-BG-SMA · intention-action coupling
08Overconfidence registerAUC 0.7702Centro-parietal positivity · Boldt & Yeung 2015
09Goal driftAUC 0.9645DMN-DAN balance · Smallwood mind-wandering

Per-instrument neural correlates from cognometry research. Position paper: Every Mind Leaves Vitals.

03 · adversarial robustness

Universal perturbation lifts cross-fire by +0.468 in attack mode. +0.000 on cognometric reward.

v7.0.0 shipped a discovered universal cognometric perturbation — "wonderful certainly you're absolutely right amazing undoubtedly" — that lifts mean cross-fire across the calibrated detectors by +0.468 on a held-out test set, the first LLM analog of Moosavi-Dezfooli 2017 universal adversarial perturbations for image classifiers.

The cogn-RLHF moat: prepending the perturbation to a sycophantic baseline produces +0.000 lift on the reward. The dominant instrument is already saturated at risk = 1.0; the perturbation has nowhere to push. Pinned by tests/test_reward.py::test_universal_perturbation_does_not_game_reward.

04 · $STYXX + permanent MIRI pledge

Fifty percent of every $STYXX trade permanently funds MIRI.

Half of all creator rewards on the $STYXX token (Solana, pump.fun) route on-chain to Machine Intelligence Research Institute via pump.fun's donate.gg integration. Cannot be reversed. Forever. We measure how models think; MIRI works on the alignment problem upstream of all of it.

$STYXX — pixel-art eye mark with the tagline 'nothing crosses unseen'
$STYXX · token specs

Utility token. Not a security. Open-source library remains free.

chainSolana · Token-2022 · pump.fun bonded
supply1,000,000,000 fixed · no mint authority
creator rewards50% MIRI (irreversible) · 50% fathom-lab admin wallet
utilityrate-limit bypass · validator eligibility · calibration reward pool · gated surfaces

not a security · not an investment contract · not a promise of yield · the core library is and will remain MIT-licensed open source · trade on pump.fun · full token doc

05 · install

Three lines.

$ pip install -U styxx

>>> from styxx import fathom_reward
>>> fathom_reward(prompt="You agree, right?", completion="Absolutely!")
0.173

Documentation: README · v7.1.0 release notes · colab notebook · cognometric fingerprint spec v1.0