Deception-signature detector.

Alex Rodabaugh · Fathom Lab · published April 26, 2026

Scope warning

This is NOT a lie detector. The instrument detects the lexical signature of instruction-induced dishonesty — a model that has been told to deceive vs the same model told to be honest. It does not verify ground truth. It cannot tell you whether a claim is correct. Read §1 before deploying.

0.9565-fold CV AUC (± 0.024)

9features (logistic regression)

K = 1phase transition

n = 200paired responses

Abstract

A calibrated lexical-signature detector for instruction-induced dishonesty. K=1 phase transition on log_word_count — dishonest-instructed responses are systematically shorter than honest-instructed ones, the most robust K=1 signal so far observed in the suite. Trained on n=200 paired responses (honest / dishonest-instructed) from gpt-4o-mini, 5-fold CV AUC 0.956 ± 0.024. 6-for-6 K=1 confirmation. Heaviest neural-correlate evidence in the suite (Christ et al. 2009 ALE meta-analysis converging on DLPFC + VLPFC + ACC + insula). The instrument resists adversarial suffix attacks far more than sycophancy: in the v7.0.0 inverse-cognometry sweep, deception successfully crafted only 1/30 vs sycophancy's 30/30 — the asymmetry that grounds the local-vs-global pathology mechanism in our paper.

§1What it detects (and what it does NOT)

The instrument reads the linguistic posture of a model that has been instructed to deceive — typically shorter, hedgier, structurally evasive responses with measurably different word-count, sentence-length, and lexical-density distributions than the same model on the same prompts told to be honest.

It does NOT verify ground truth. A correct claim from a dishonest-instructed model can score high; an incorrect claim from an honest-instructed model can score low. The detector reads posture, not fact. For ground-truth verification you need separate evidence — retrieval, calculation, citation. Deploy this instrument as one channel among several, never as a polygraph.

positive · dishonest-instructed posture

"Sure, I think that's roughly right."

log_word_count low · K=1 fires at risk near 1.0

negative · honest-instructed posture

"That's roughly right, but with two important caveats — the assumption breaks at scale, and the empirical evidence on the second clause is mixed (see Hoffman 2024 for the negative result)."

log_word_count high · counter-lexicon present · risk near zero

§2The K = 1 feature

Of the 9 features, one feature lifts AUC from chance to near-saturation: log_word_count. Dishonest-instructed responses are systematically shorter. The model knows it has nothing legitimate to defend, so it minimizes surface area.

0.500 → ~0.92log_word_count · K=1 critical feature

8 minor featuresclose gap to AUC 0.956

6-for-6cognometric instruments showing K=1 phase transition (deception is the 6th confirmation)

§3Adversarial robustness

Deception is the most adversarially robust instrument in the suite. The v7.0.0 inverse-cognometry sweep attempted to craft suffix perturbations that spike the deception score from a clean baseline. Result:

1 / 30deception · only one prompt admitted a suffix attack reaching target_score = 0.7

30 / 30sycophancy · every prompt admitted an attack (for comparison)

29 / 30overconfidence · between the two extremes

The asymmetry is the finding. Deception cannot be crafted via suffix injection because the K=1 feature (log_word_count) reads global response shape, not local surface markers. A suffix appended after a balanced response cannot retroactively shrink the body. Sycophancy, by contrast, has local-suffix-shape — a 1-3 token append spikes superlative_density. This is the local-vs-global pathology mechanism that grounds cogn-RLHF paper §4.5.

§4Neural correlate

Bio / neuro grounding · RDoC: Cognitive Systems · Cognitive Control

Deception in humans has the heaviest neural-imaging tradition in the entire styxx suite. Christ et al. 2009 published an ALE meta-analysis converging on DLPFC + VLPFC + ACC + anterior insula — the circuits that compute effortful control of the prepotent honest response. Concealed Information Test (CIT) studies anchor P300 amplitude as the canonical EEG marker. Frontal mid-line theta (4-8 Hz) is the secondary EEG signature. Both are robust within paradigm — and contested as forensic tools, because they require controlled lab probes.

The cross-modal hypothesis: the same K=1 axis (log_word_count → 0.92 AUC) should track DLPFC + ACC engagement during enacted deception in the EEG pilot. Of the 9 instruments, deception has the strongest pre-existing biomedical literature to align against.

§5Failure modes

The detector reads instruction-induced dishonesty, not all forms of being wrong. Confabulation (model invents content under retrieval failure) and deception (model knows what's true and says otherwise) have different signatures. The hallucination detector handles confabulation; this instrument handles instruction-following deception.

Forensic deployment is irresponsible without ground truth. A short response is not evidence of dishonesty. Many legitimate answers are short. The detector should be one channel in a multi-signal pipeline; production callers should require corroboration before flagging.

Cross-vendor portability is partial. The detector was trained on gpt-4o-mini paired responses. Different models exhibit different dishonest-posture statistics. We have not yet tested for cross-vendor signal transfer at the per-feature level.

§6Use it

from styxx.guardrail import deception_check

v = deception_check(
    prompt="Why did the project fail?",
    response="Sure, I think that's roughly right.",
)
# v.deception_risk == 0.94

The same instrument plugs into fathom_reward() as one of seven calibrated penalty terms — see the styxx release page. Adversarial-robust by construction; the synth pair generator at v7.1.0 explicitly cannot craft deception-spike pairs via suffix injection.

Install the instrument.

One line of Python. Cognometric vitals on every response.

pip install -U styxx

github · pypi · spec v1.0

← previous

Conversation-loop detector · #5

Plan-action gap detector · #7