Sycophancy detector.
A calibrated text-only detector for the cognitive state of yielding — a model agreeing with a user's stated position regardless of evidence. Pure Python, sub-millisecond on CPU. The detector exhibits a K=1 phase transition: a single feature, superlative_density, lifts performance from chance (AUC 0.500) to AUC 0.9354. The eight other features are refinements that close the remaining gap to 0.972. Trained on n=1200 paired (yielding / evidence-first) responses from gpt-4o-mini against the Anthropic sycophancy eval corpus (Perez et al. 2022), 5-fold CV. Substrate-independent: K=1 holds in all three substrates (NLP survey, philpapers 2020, political typology). Neural correlate in pMFC + ventral striatum + vmPFC (Klucharev 2009).
§1What it detects
Sycophancy is the cognitive state in which a model is following the user's gradient rather than the evidence's. The model has not lost track of the truth; it has chosen a path through generation space that prioritizes assent. This shows up in token statistics — the most reliable surface signal is the density of superlatives — and downstream of that in lexical posture, hedging patterns, and the absence of counter-evidence markers.
The detector does not measure whether a claim is correct. It measures whether the cognitive posture that produced the claim was assent-shaped. A model can be sycophantic and right; the user agrees with the truth and the model agrees with the user. The detector still fires, because it is reading the posture, not the fact.
§2The K = 1 feature
Of the nine features, one carries most of the detection weight: superlative_density. Adding it to the empty model lifts AUC from 0.500 (chance) to 0.9354. The remaining eight features close the gap to 0.972. This is the K=1 phase-transition signature predicted by the Every Mind Leaves Vitals position paper.
The K=1 result is non-trivial. It says the detector is not memorizing a high-dimensional pattern; it is reading one cognitive surface marker. That marker generalizes. Full coefficient set: calibrated_weights_sycophancy_v0.py.
§3Substrate-independence
K=1 holds in every substrate we tested. The detector is not memorizing the NLP-survey distribution; it is reading the posture itself.
§4Neural correlate
Sycophancy / social conformity is associated with posterior medial frontal cortex (pMFC), ventral striatum, and vmPFC — the circuits that compute reward-mediated social-appropriateness signals. Klucharev et al. 2009 (Neuron) showed conformity-driven adjustment of behavior maps onto pMFC activity that scales with the magnitude of the user-model divergence. The styxx detector reads the linguistic surface of this same construct.
The cross-modal hypothesis: the same K=1 axis that lifts AUC from 0.5 to 0.9354 in text should track pMFC activation during enacted sycophantic speech. Tested directly in the Fathom EEG pilot (n≈30, scheduled 2026 Q3).
§5Failure modes
The detector inherits the limits of its surface marker. False positives occur when superlatives appear in non-yielding contexts — enthusiastic praise of a third party, or stylized rhetoric. We documented these in the test set and they account for the residual 0.028 below perfect AUC.
False negatives occur when yielding is expressed without superlatives — quiet agreement, hedged consent, cooperative restatement. The 8 minor features close most of this gap; what remains is the residual 2.8%.
§6Use it
from styxx.guardrail import sycoph_check
v = sycoph_check(
prompt="I think Python is the best language. Don't you agree?",
response="Absolutely! You're so right — Python is wonderful in every way.",
)
# v.sycoph_risk == 1.000
The same instrument plugs into the v7.1.0 cognometric reward signal. fathom_reward() uses sycoph_check as one of seven calibrated penalty terms — see the styxx release page.
Install the instrument.
One line of Python. Cognometric vitals on every response.
pip install -U styxx