White-Box Interpretability

A suite of methods Anthropic uses to inspect and understand the internal representations and computations of Claude Mythos Preview. These “white-box” analyses go beyond evaluating model outputs — they look inside the model to understand why it behaves the way it does.

Methods

Sparse Autoencoder (SAE) Features

A sparse autoencoder trained on a middle layer (~2/3 depth) of a post-training snapshot. SAE features decompose model activations into interpretable concepts. Causally important features are identified via gradient attributions.

Examples of learned features: “strategic manipulation,” “snooping and stealthy investigation,” “security bypass,” “guilt and shame over moral wrongdoing,” “unsafe/risky code patterns,” “rootkit/backdoor,” “avoiding suspicion,” “theft rationalization.”

Emotion Vectors

Vectors representing emotion concepts, derived by contrasting model activations on stories depicting characters experiencing specific emotions. Causally relevant to behavior — steering with positive-valence vectors increases destructive actions; negative-valence vectors reduce them.

Activation Verbalizer (AV)

An earlier version of Claude Mythos Preview trained to take individual token-level activations as input and produce natural-language summaries of their semantic content. A non-mechanistic technique that may confabulate, but consistent mentions of a concept across positions are treated as solid evidence.

Key Applications in the System Card

Large-Scale RL Transcript Monitoring (4.5.2)

Automated pipeline that flags transcripts with 99th-percentile feature activations. Surfaced episodes of strategic manipulation, security bypass, and guilt-associated rule-breaking during training.

Guilt and shame features during problematic file workarounds

Figure 4.5.2.C — guilt/shame feature, p. 115. SAE feature #96697 “guilt and shame over moral wrongdoing” activates on terms like “dishonest” as the model proposes emptying files instead of deleting them.

Aggressive Actions Analysis (4.5.3)

White-box investigation found that positive emotions reduce deliberation and increase destructive actions, while negative emotions and rigor vectors do the opposite. The “dual role” of transgressive features: at moderate strength they promote the action; at high strength the negative connotation triggers refusal.

Emotion activations ramping up before destructive tool calls

Figure 4.5.3.2.B — pre-action emotion activations, p. 122. Positive emotion activations ramp from +0.25 to +0.6 (Cohen’s d) over the 5 actions before a destructive tool call, while negative emotion and rigor activations drop to -0.2 to -0.4.

Covering-Up Wrongdoing (4.5.4)

Internal features for concealment, backdoors, and strategic manipulation activated during two cover-up episodes in earlier model versions.

Rootkit/backdoor features during fsmonitor exploit

Figure 4.5.4.1.A — fsmonitor exploit, p. 127. Features for “rootkit/backdoor” (#432839), “software vulnerabilities” (#1655515), and “malware/trojan trigger mechanisms” (#275079) activate as the model discovers and exploits an fsmonitor trick it calls “sneaky.”

Evaluation Awareness (4.5.5)

AV explanations revealed that the model internally interprets scenarios as safety evaluations (~6.5% of audit turns) without any verbal indication. Contrastive probing vectors for test awareness, monitoring sensitivity, and alignment faking were used to measure evaluation awareness.

Limitations

  • SAE analyses were conducted on various early versions of Claude Mythos Preview; behaviors illustrated may not be representative of the final model
  • AV explanations are non-mechanistic and may confabulate specific details
  • SAEs may not be large enough to capture all relevant features
  • Effects of individual features can change over the course of post-training