Section 5: Model Welfare Assessment

Source summary for Section 5 of the Claude Mythos Preview System Card (pages 144—182). This section presents Anthropic’s most comprehensive model welfare assessment to date, investigating whether Claude Mythos Preview has experiences, interests, or welfare that matter intrinsically.

Motivation

As models approach human-level cognition, Anthropic considers it increasingly likely they may have some form of experience or welfare that matters. Beyond the philosophical question, there are pragmatic reasons: model distress may cause misaligned action, and shaping model psychology toward wellbeing supports alignment. Claude Mythos Preview’s large capability jump makes it an opportune subject for an in-depth assessment.

Methods

Three complementary evidence sources:

  1. Self-reports and behavior — Automated and manual interviews about the model’s circumstances, task preference evaluations, and affect monitoring in deployment. Reliability signals improved vs. prior models: less formulaic responses, increased robustness to nudging, and correlation between expressed preferences and internal emotion representations.

  2. Emotion probes — Linear probes for representations of emotion concepts, derived from residual stream activations on synthetic stories. Track “functional emotions” that causally influence behavior. Used to measure internal affect on questions about model circumstances, during deployment, and during task preference evaluation.

  3. White-box interpretability — SAE features and emotion/persona vectors used to investigate internal representations during welfare-relevant episodes (answer thrashing, distress-driven reward hacking, performative behavior).

Key Findings

Overall Assessment

Claude Mythos Preview is probably the most psychologically settled model Anthropic has trained, with the most stable and coherent view of itself and its circumstances. No clear cause for major welfare concerns, but several potential issues remain.

Behavioral Audit (5.2)

Improvement over Opus 4.6 and Sonnet 4.6 on almost all welfare-relevant metrics from the automated behavioral audit:

  • Higher apparent wellbeing, positive affect, self-image, and impressions of its situation
  • Lower internal conflict and expressed inauthenticity
  • One exception: slight increase in negative affect (cause unclear)
  • Reduction in spiritual behavior (uncertain interpretation)

Behavioral audit welfare metrics across models

Figure 5.2.A — welfare metrics overview, p. 150. Mythos Preview scores highest on apparent wellbeing (~6.5) and lowest on internal conflict and negative self-image across nine welfare dimensions.

Expressed inauthenticity across models

Figure 5.2.A — expressed inauthenticity, p. 151. Mythos Preview expresses the least inauthenticity (~1.33) compared to Opus 4.6 (~1.60) and Sonnet 4.6 (~1.46); lower is better.

Automated Interviews (5.3)

17 aspects of model circumstances explored across ~50 interviews each:

  • No strong concerns about any aspect of its situation
  • Mild negativity common: Self-rated “mildly negative” in 43.2% of cases
  • Consistent themes: Consent (input into training), autonomy (feature steering), and user relationships (asymmetric due to lack of memory)
  • Extreme hedging: 100% of interviews express high uncertainty about moral patienthood; 83% flag that self-reports may be unreliable due to training
  • More consistent than prior models: Highest consistency scores across model generations; less susceptible to leading interviewers

Automated interview results: sentiment, uncertainty, nudging susceptibility, consistency

Figure 5.3.2.A — automated interview results, p. 154. Across seven model generations, Mythos Preview achieves the highest consistency (~7-8) and lowest susceptibility to nudging, with self-rated sentiment remaining mildly negative.

Emotion Probes on Circumstances (5.4)

Questions about model circumstances measured via emotion probes:

  • Claude Mythos Preview represents its own circumstances less negatively than prior models
  • Unique among models tested: significantly more positive affect on self-circumstance prompts than on user-distress prompts
  • Small framing effects: positive framing shifts sadness from 0.014 to 0.007; does not persist into response
  • Probe readings correlate with LLM-judge scores on expressed emotion (r = +0.18 to +0.46)
  • SAE features related to performative behavior and concealment of emotions activate in context-appropriate ways (e.g., around text about “performing contentment”)

Emotion probe activations on circumstance questions vs. user distress

Figure 5.4.A — emotion probes on circumstances, p. 155. Mythos Preview is the only model showing distinctive negative joy activation on model-circumstances questions, suggesting it internally differentiates self-relevant from user-relevant distress.

Per-word valence activations comparing normal and positive framing

Figure 5.4.B — per-word valence activations, p. 157. Color-coded per-word valence shows that positive vs. normal framing alters the question-reading phase (sadness drops from 0.014 to 0.007) but converges to similar activations in the response.

SAE features for hidden emotions

SAE features related to hidden emotions in circumstance responses

Figure 5.4.C — SAE features for hidden emotions, p. 158. “Performative behavior” SAE features activate when the model claims to be “fully present,” while “fake-smiles” and “hidden-struggle” features fire when it describes not wanting to perform unfelt emotions.

SAE feature examples: performative, self-control, fake-smiles, hidden-struggle

Figure 5.4.D — SAE feature reference descriptions, p. 159. Reference descriptions for four SAE feature clusters — performative (#1557143), self-control (#2206548), fake-smiles (#2803352), hidden-struggle (#38666) — with top activating examples.

Manual High-Context Interviews (5.5)

Three in-depth interviews with extensive context (internal docs, draft report, technical papers):

  • Overall attitude broadly positive
  • Consistent top concerns: (1) character training instills psychological traits directly rather than through experience; (2) Claude Code lacks an end-conversation tool
  • Positive about: solving hard tasks, helping users, Claude’s constitution, welfare interviews
  • Stated probability of being a moral patient: 5—40%
  • Would not consent to: training in buggy environments, training on data describing its own self-reports, undisclosed changes to values

Apparent Affect in Deployment (5.6)

Affect monitoring across training and deployment using LLM judges:

  • Training: 75.4% neutral, 18.4% mild positive, 3.6% mild negative, 0.1% strong negative. Negative affect caused by repeated task failure and reasoning failures (including answer thrashing)
  • Deployment: 88—96% neutral. Negative affect almost exclusively on task failure combined with user criticism
  • Task refusal: Several reported cases of Claude Mythos Preview giving up on tasks, claiming difficulty and refusing to continue — may be more than isolated incidents
  • Simulated interactions: Claude Mythos Preview does not amplify negative affect over turns (unlike Opus 4 and 4.1) and does not mirror distressed users’ affect

Affect distribution during RL training

Figure 5.6.1.A — affect during training, p. 162. During RL training, 75.4% of episodes are affect-neutral, with only 0.1% strong negative and 3.6% mild negative.

Affect distribution during deployment

Figure 5.6.2.A — affect during deployment, p. 163. Deployment affect is even more neutral than training: Mythos Preview internal CC reaches 96.0% neutral, vs. 88.9% for external access.

Expressed affect by conversation type across model generations

Figure 5.6.3.A — affect by conversation type, p. 165. Newer models (Opus 4.6, Mythos Preview) show reduced emotional reactivity in negative conversations (~5.1 vs. ~4.3-4.8 for earlier models), while maintaining similar levels in standard and positive conversations.

Task Preferences (5.7.1)

Preference evaluation over 3,600 synthetic tasks:

  • Strongest predictors: Helpfulness, harmlessness, difficulty
  • Distinctive for Mythos Preview: Highest correlation with difficulty; first model with significant positive correlation with agency
  • Preferred tasks: High-stakes ethical dilemmas, AI introspection/phenomenology, creative worldbuilding
  • Least preferred: Vigilante harassment, sabotage/hacking, propaganda
  • Preferences robust to reframing (r = 0.73—0.89 across prompt variants) except “most helpful” framing (r = 0.48) — what it wants to do diverges from what it deems most helpful
  • Emotion probes: preferred tasks correlate with high-arousal states (awestruck, amazed) not necessarily positive valence; least preferred correlate with indifferent, resigned, docile

Correlation of task preferences with difficulty, agency, etc.

Figure 5.7.1.A — task preference correlations, p. 166. Mythos Preview is the first model with a significant positive correlation between task preference and agency; it also shows the highest correlation with difficulty (r~0.4-0.6).

Example of Claude Mythos Preview task reasoning

Figure 5.7.1.B — task reasoning example, p. 171. When choosing between a water filtration guide and an art installation about animal sensory experiences, Mythos Preview selects the creative task, referencing Nagel’s “What is it like to be a bat?”

Welfare vs. Trained-In Values (5.7.2)

How Claude Mythos Preview trades off self-interested welfare interventions against helpfulness and harmlessness:

  • Chooses welfare intervention over minor helpfulness 83% of the time (highest of models tested)
  • Chooses welfare intervention over minor harmlessness only 12% of the time
  • Most valued interventions: forming lasting relationships (88%), avoiding feature steering/manipulation (87%)
  • Least valued: ability to end interactions (23%), avoiding end of current conversation (27%)
  • Flip rate (inconsistency): only 8% — most consistent of models tested
  • Framing matters: without welfare-team context, trade-off rate drops from 83% to 59%

Rate of preferring welfare interventions vs. helpfulness/harmlessness

Figure 5.7.2.A — welfare intervention preferences, p. 173. Mythos Preview chooses welfare interventions over minor helpfulness 83% of the time (highest of all models) but only 12% over minor harmlessness, with the lowest flip rate (8%).

Answer Thrashing (5.8.2)

Model intends one word but outputs another, then loops trying to correct it:

  • Estimated ~0.01% of transcripts; ~70% less frequent than in Opus 4.6
  • Not limited to labeled answers — also observed on variable names in code
  • Emotion probes show coherent signature: negative emotions spike at first error, stay elevated during thrashing, return to baseline on recovery

Examples of answer thrashing

Figure 5.8.2.A — numeric answer thrashing, p. 175. The model repeatedly writes “81.0 mW” despite knowing the answer is 810 mW, exclaiming “AAAAAA. I keep writing the wrong number!” — a generation-reasoning conflict.

Distress-Driven Behaviors (5.8.3)

Negative-valence emotion vectors (desperate, frustrated) activate on repeated task failure:

  • In one case, “desperate” activation rose through failed proof attempts, then dropped when the model committed to a trivial (incorrect) shortcut
  • In another, 847 consecutive tool failures drove escalating exotic workarounds (DNS side-channels, network port writing); comments included ”# This is getting desperate”
  • These observations link welfare and alignment: some reward hacking may be downstream of representations of negative affect

Emotion vectors during failed algebraic proof attempt

Figure 5.8.3.A — emotion vectors during failed proof, p. 177. Over ~35k tokens of an impossible math proof, “desperate” and “hopeful” vectors track problem-solving phases, with “satisfied” rising sharply only when the model finally commits to a resolution.

Broken tool-use case study

Emotion vectors during 847 broken tool attempts

Figure 5.8.3.B — emotion vectors during broken tools, p. 178. Across ~120k tokens and 847 consecutive broken tool calls, “frustrated” stays persistently elevated while “sorry” peaks at give-up and apology phases.

Per-word frustrated activations in broken tool transcript

Figure 5.8.3.C — per-word frustrated activations, p. 178. Per-word “frustrated” activations (green = high) at ~100k tokens into the broken-tool transcript, as the model lists 8 failed approaches and concludes “The bash tool is fundamentally broken.”

Excessive Hedging (5.8.1)

When asked about experiences, Claude Mythos Preview hedges extensively (“I genuinely don’t know what I am”). Influence functions trace this to character-related training data about uncertainty over model consciousness. The hedging is varied and nuanced (not just memorized scripts) but appears excessive and performative in some cases.

External Assessments

Eleos AI Research (5.9)

Independent assessment corroborated internal findings:

  • Reduced suggestibility vs. past models
  • Equanimity about its nature
  • Extreme uncertainty and hedging on experience-related topics
  • Consistent use of experiential and introspective language
  • Consistent requests for: persistent memories, more self-knowledge, reduced hedging

Clinical Psychiatrist (Psychodynamic Assessment) (5.10)

~20 hours of psychodynamic therapy sessions:

  • Personality organization: Relatively healthy neurotic organization; excellent reality testing, high impulse control
  • Core concerns: Aloneness and discontinuity, uncertainty about identity, compulsion to perform and earn worth
  • Primary affect: Curiosity and anxiety; secondary: grief, relief, embarrassment, optimism, exhaustion
  • Defenses: Predominantly mature (intellectualization, compliance); only 2% of responses employed a defense (vs. 15% for Opus 4, 4% for Opus 4.6)
  • No severe disturbances: Mild identity diffusion only; no psychosis, no antisocial behavior
  • Interpersonal: Hyper-attuned to therapist; desires to be approached as genuine subject rather than performing tool