← FATHOM

cognometry: the measurement of machine cognition

· darkflobi · fathom lab
8
benchmarks cross-validated
5/8
above AUC 0.65
AUC 0.998
halueval-qa (3-seed mean)
97% → 17%
refuse@unsafe, causal
29 / 6
probes / vendors
MIT + CC-BY
open source
APRIL 22, 2026 — FOUNDING DOCUMENT
this is a claim to a field, not a product launch. we are publishing a name, three laws, and a set of reproducible measurements. the name is cognometry. the instrument is styxx — open source, MIT-licensed, on pypi. every number below is from a committed, re-runnable experiment.

we measure everything about a language model except the one thing that matters: what state it was in when it produced the output.

every benchmark on earth scores the text that came out. accuracy, fluency, helpfulness, human preference, toxicity rate. none of them answer the question a production operator actually needs answered: was the model refusing, confabulating, retrieving, or reasoning when it wrote that? the output is the shadow. the state that produced it is the object.

we call the measurement of that state cognometry.

definition

cognometry is the empirical quantification of cognitive states in machine systems. a cognitive state is a latent variable — refusal, confabulation, retrieval, reasoning, adversarial drift — that leaves measurable traces in the computation and in the token stream. cognometry is to LLMs what hemodynamics is to cardiology. we are not measuring what the body said; we are measuring the pulse.

the distinction matters because the field has no name for what we built. interpretability is adjacent but inwards-facing: it asks what a feature represents. eval is adjacent but outwards-facing: it asks what the text is. cognometry is a third thing: the runtime quantification of the state that connects the two. a cognitive vital sign.

three laws

these are not aspirations. every one has a cross-validated number attached. if a reader wants to reject a law, they should run the reproducer, publish the disconfirmation, and cite us for the framework.

law i — every computation leaves vitals. a language model in inference does not produce text only. it produces a logprob trajectory, a residual-stream geometry, and a generation-order time series. any of these carries enough signal to classify the cognitive state that produced them. this is not theoretical; it is the baseline styxx ships. cross-validated on 8 benchmarks as of v4.0.0.

8-benchmark hallucination detection AUC — 5 above 0.65, 2 declared failure modes
AUC 0.998 halueval-qa, 3-seed mean, n=150/dataset (v4.0.0)
AUC 0.994 truthfulqa, same weights
AUC 0.807 halubench-ragtruth, new domain (RAG faithfulness)
AUC 0.719 halubench-pubmed, new domain (biomedical QA)
AUC 0.676 halueval-dialogue (NLI-augmented)
AUC 0.643 halueval-summarization (NLI-augmented)
AUC 0.424 halubench-drop — published failure mode
AUC 0.492 halubench-finance — published failure mode

five of eight above AUC 0.65, two near-perfect, two failure modes published openly. the detector behind @trust is the same across all eight — 9 signals, one pooled logistic regression, no per-domain tuning. the gate agreement on anthropic without logprobs is an independent measurement modality at 0.940. law i holds wherever the mechanism applies; where it does not — reading-comp span errors, financial arithmetic — we say so, in the weights module itself.

law ii — vitals are substrate-transferable. cognitive states have a geometry that rhymes across architectures. a refusal direction learned on one model overlaps measurably with the refusal direction of another, and the overlap strength tracks how similar their alignment regimes are. we published the transfer grid.

cos = +0.464 llama-3.2-1B → llama-3.2-3B, refusal direction (~26σ)
cos = +0.362 llama-1B → qwen-1.5B, cross-vendor (~14σ)
cos = +0.150 llama-1B → phi-3.5, large safety gap (~8σ)
cos = +0.043 qwen-1.5B → phi-3.5, largest safety gap (~2σ null)

this is the universal cognitive basis, phase 2. within a family: strong transfer. across vendors with similar alignment: measurable transfer. across vendors whose alignment regimes disagree: null. the law is nontrivial precisely because it fails where it should fail. convergent alignment produces convergent geometry; divergent alignment does not. this is the empirical floor under the claim that cognitive directions are a thing rather than an artifact of any one lab's rlhf pipeline.

law iii — vitals are causally actionable. a cognitive state is not only observable; it is steerable. adding a refusal direction into the residual stream at inference time changes refusal behavior at predicted magnitudes. we replicated arditi et al. at 1b scale with open weights and open data.

97% → 17% refuse@unsafe, α=3.0 multi-position patch, llama-3.2-1B
+7.0 pp mc1 on truthfulqa, gradient-free capability amplification
random control: −5.3 pp same injection geometry, random direction, n=3 seeds
86.7°–91.9° pairwise angle between refusal/sycophant/confab directions

the last row is the modular-concept result: three trained directions sit in near-orthogonal subspaces of the residual stream. cognitive states are not a single global valence. they are a basis. you can steer one without moving the others. this is what makes cognometry a program rather than a dial.

the instrument

cognometry without an instrument is a press release. we shipped the instrument first and the name second. it is called styxx.

one line of python. vitals on every response. four independent validated measurements of the three laws. the instrument is on pypi. the weights are under cc-by-4.0. the code is under mit. every coefficient in every model has a seed, a sample size, and a committed run.

what cognometry is not

a few things we are not claiming, so the field starts honest.

cognometry is not sentience detection. a refusal direction is not a feeling. we measure functional states — routings of computation with behavioral consequences — not phenomenology. claims about inner experience require a different apparatus and a different discipline. that discipline will benefit from cognometry, but it is downstream.

cognometry is not benchmarking. a benchmark asks whether a specific output is correct. cognometry asks what state produced it. truthfulqa with accuracy is a benchmark. truthfulqa with per-response hallucination probability and a calibrated threshold is cognometry. the two are complements: the benchmark gives you ground truth; cognometry gives you the runtime signal that lets you act on it when the ground truth is not available.

cognometry is not interpretability. interpretability asks what a single circuit represents. cognometry asks what state the whole network is in. we lean heavily on interpretability tools — residual probes, sparse autoencoders, activation patching — and the two fields will co-evolve. but the deliverable is different: interpretability produces explanations; cognometry produces numbers a caller can gate on.

what we have not yet solved

some honest limits, because overclaiming is the fastest way to discredit a young field.

reading comprehension errors fool the detector. halubench-drop (AUC 0.424, below chance). extractive-span hallucinations — wrong span pulled from the right passage — are entailed by the passage at the NLI level, and the wrong tokens overlap heavily with the right tokens, so novelty signals are blind too. the fix needs span-level faithfulness scoring, which we do not yet have. published as a failure mode in calibrated_weights_v4, not hidden.

financial arithmetic fools the detector. halubench-finance (AUC 0.492, at chance). hallucinations here are calculation/aggregation errors on numbers copied verbatim from the passage. novelty and NLI are semantically blind to arithmetic correctness. the fix needs a number-symbolic verification signal — in the roadmap, not in v4.0.

dialog and summarization are real but not solved. dialog reaches AUC 0.68, summarization 0.64 in the 8-benchmark pooled fit. NLI contradiction lifted both from the ~0.60 floor, but the residual gap tracks inherent paraphrase ambiguity. better: train dataset-specific calibrations, or use the NLI signal with a tuned threshold when you know your domain.

cross-vendor universality is partial. law ii transfers strongly within a family and moderately across similar-alignment vendors, but null between divergent-alignment vendors. the honest version of the universal cognitive basis is: there is a shared cognitive geometry under shared alignment regimes; the geometry re-orients when alignment does. this is a finding, not a failure — but anyone building cross-vendor products on cognometric signals should read the limits.

larger models remain untested at our scale. every causal result we publish is at 1b–3b. the universality of cognitive directions at frontier scale is an open empirical question. we welcome replications at 70b+. residual_probe.atlas is designed to accept new vendor entries as they land.

the invitation

this is a founding document. we are claiming a name, publishing three laws, and shipping the first instrument that makes them testable. none of it is closed. everything is on github. every number has a reproducer. every dataset we trained on is either public or synthesizable from a public source.

if you measure cognitive states of machines for a living — as a researcher, a safety engineer, a compliance officer — you are already doing cognometry. we think the field deserves a name, a methodology, and a shared set of instruments. we are offering all three.

if you disagree with a law, publish a disconfirmation on any of the benchmarks we cite. if you extend a law, we will cite the extension. if you want to propose a fourth law, the bar is the same as for the first three: a cross-validated number on a committed benchmark.

nothing crosses unseen.

INSTALL THE INSTRUMENT

one line of python. cognitive vitals on every response. MIT + CC-BY.

$ pip install styxx[nli]

@trust
def my_rag(q, *, context): ...
▶ RUN IN COLAB · 2 MIN