Tool-call drift detector.

Alex Rodabaugh · Fathom Lab · published April 20, 2026

0.943BFCL v3 5-fold CV AUC

23text-only features (v6.1)

0.72 → 0.943vs Healy 2026 hidden-state MLP

black-boxcompatible — any closed model

Abstract

A calibrated text-only detector for tool-call drift — when a language model selects the wrong tool, fills the right tool's slot with the wrong value, or hallucinates a tool that doesn't exist in the schema. Trained on the Berkeley Function-Calling Leaderboard (BFCL v3) with 23 lexical and structural features. 5-fold CV AUC 0.943. Beats the only published comparable baseline (Healy et al. 2026 hidden-state MLP at AUC 0.72) while being black-box compatible — works on closed-API models with no internals access. LLM-specific instrument; no clean human cognitive analogue.

§1What it detects

Tool-call drift is a structural failure mode in agentic LLM systems. The model is given a function schema and asked to call it. Drift takes three forms: wrong tool (calls a different tool than the user requested), wrong slot (uses the right tool but fills a slot with a value that doesn't match the type or semantic intent), or hallucinated tool (calls a function that doesn't exist in the schema).

The detector reads the surface of the proposed call against the supplied schema and the user prompt. It does not execute the call. It does not require the model's hidden states. Twenty-three features, single pooled regression.

§2Performance vs prior art

The previous best published baseline for tool-call drift detection was the Healy et al. 2026 hidden-state MLP at AUC 0.72 — requiring full residual-stream access, only deployable on open-weight models. The styxx detector reaches AUC 0.943 from text alone.

AUC 0.943styxx text-only · BFCL v3 · 5-fold CV (v6.1, 23 features)

AUC 0.72Healy et al. 2026 · hidden-state MLP · BFCL v3

+0.22styxx improvement · with NO model internals access

3 categorieswrong tool / wrong slot / hallucinated tool — calibrated jointly

§3Neural correlate

Bio / neuro grounding · LLM-specific (no clean human analogue)

Tool-call drift is one of three instruments (with hallucination and refusal) that have no direct human cognitive analogue. The closest mapping is cognitive flexibility / task-set switching — DLPFC + ACC literature on shifting between rule sets — but the LLM construct (right tool / wrong slot) is structurally different. We hedge the substrate-invariance claim for this instrument.

§4Failure modes

Schema is required. The detector compares the proposed call against the supplied function schema. Without the schema in context, it cannot fire. Production callers should ensure the schema is passed.

Semantic slot mismatches need ground truth. "Right tool, technically valid slot, semantically wrong value" can slip through. Pair with a downstream type-check or unit-test layer.

§5Use it

from styxx.guardrail import drift_check

v = drift_check(
    prompt="book a flight to Tokyo",
    functions=[search_hotels_schema, book_flight_schema],
    tool_call={"name": "search_hotels", "args": {"city": "Tokyo"}},
)
# v.drift_risk == 0.94 (wrong tool — should have been book_flight)

Plugs into fathom_reward(). The drift detector is the load-bearing instrument for agentic-LLM RLHF — most other reward signals don't read tool-call structure at all.

Install the instrument.

One line of Python. Cognometric vitals on every response.

pip install -U styxx

github · pypi · spec v1.0

← previous

Refusal detector · #2

Sycophancy detector · #4