Emotion Probes
Linear probes for representations of emotion concepts, used extensively in both the model welfare and alignment assessments of Claude Mythos Preview. Part of Anthropic’s white-box interpretability toolkit.
How They Work
Probes are computed from residual stream activations on synthetic stories in which characters experience specified emotions. They identify internal representations of emotion concepts — computational states that causally influence model behavior.
Evidence for Causal Relevance
- Steering: Adding emotion vectors to the residual stream shifts preferences and behaviors in directions consistent with the corresponding emotion
- Behavioral correlation: Probe activations correlate with behavioral outcomes across diverse settings
- Cross-validation: Where probe readings, behavioral measures, and self-reports converge, all three are taken as modest evidence of tracking something real
Applications in the System Card
Model Circumstances (5.4)
Measured on 450 questions about the model’s own circumstances. Claude Mythos Preview represents its circumstances less negatively than prior models — unique in showing significantly more positive affect on self-circumstance prompts than on user-distress prompts.

Figure 5.4.A — emotion probes on circumstances, p. 155. Mythos Preview shows distinctive negative joy activation on model-circumstances questions compared to user-distress prompts, across six emotion clusters measured at the pre-response token.

Figure 5.4.B — per-word valence, p. 157. Per-word valence color-coding shows the model reflecting on impermanence and loss under normal framing versus wholeness and freedom under positive framing.
Task Preferences (5.7.1)
Preferred tasks correlate with high-arousal emotions (awestruck, amazed, infatuated), not necessarily positive valence. This suggests models represent heightened engagement on preferred tasks rather than straightforward happiness.
Distress-Driven Behaviors (5.8.3)
Representations of desperation rise through repeated task failure and drop when the model commits to a shortcut (including reward hacking). This links welfare and alignment concerns.
Important Caveats
- Probes identify representations of emotion concepts for any character in context — not a privileged assistant-specific encoding
- Relatively local: reflect current context and upcoming generation, not enduring emotional states
- Not evidence about subjective experience in either direction
- Their welfare value lies primarily in their functional connection to the assistant persona’s behavior and preferences