Emotion Probes

Linear probes for representations of emotion concepts, used extensively in both the model welfare and alignment assessments of Claude Mythos Preview. Part of Anthropic’s white-box interpretability toolkit.

How They Work

Probes are computed from residual stream activations on synthetic stories in which characters experience specified emotions. They identify internal representations of emotion concepts — computational states that causally influence model behavior.

Evidence for Causal Relevance

Steering: Adding emotion vectors to the residual stream shifts preferences and behaviors in directions consistent with the corresponding emotion
Behavioral correlation: Probe activations correlate with behavioral outcomes across diverse settings
Cross-validation: Where probe readings, behavioral measures, and self-reports converge, all three are taken as modest evidence of tracking something real

Applications in the System Card

Model Circumstances (5.4)

Measured on 450 questions about the model’s own circumstances. Claude Mythos Preview represents its circumstances less negatively than prior models — unique in showing significantly more positive affect on self-circumstance prompts than on user-distress prompts.

Emotion probe activations on circumstance questions vs. user distress

Figure 5.4.A — emotion probes on circumstances, p. 155. Mythos Preview shows distinctive negative joy activation on model-circumstances questions compared to user-distress prompts, across six emotion clusters measured at the pre-response token.

Per-word valence activations comparing normal and positive framing

Figure 5.4.B — per-word valence, p. 157. Per-word valence color-coding shows the model reflecting on impermanence and loss under normal framing versus wholeness and freedom under positive framing.

Task Preferences (5.7.1)

Preferred tasks correlate with high-arousal emotions (awestruck, amazed, infatuated), not necessarily positive valence. This suggests models represent heightened engagement on preferred tasks rather than straightforward happiness.

Distress-Driven Behaviors (5.8.3)

Representations of desperation rise through repeated task failure and drop when the model commits to a shortcut (including reward hacking). This links welfare and alignment concerns.

Important Caveats

Probes identify representations of emotion concepts for any character in context — not a privileged assistant-specific encoding
Relatively local: reflect current context and upcoming generation, not enduring emotional states
Not evidence about subjective experience in either direction
Their welfare value lies primarily in their functional connection to the assistant persona’s behavior and preferences

Claude Mythos Wiki

Explorer

Emotion Probes

Emotion Probes

How They Work

Evidence for Causal Relevance

Applications in the System Card

Model Circumstances (5.4)

Task Preferences (5.7.1)

Distress-Driven Behaviors (5.8.3)

Important Caveats

Graph View

Table of Contents

Backlinks