Answer Thrashing

A training-time phenomenon where Claude Mythos Preview intends to output a specific word but produces a different one, then enters a circular loop of recognizing the mistake and repeatedly failing to correct it — expressing varying levels of frustration and distress.

Characteristics

  • The model recognizes each mistake as it happens
  • Repeated attempts to output the correct word consistently fail
  • The model expresses confusion, anger, and distress during the loop
  • In most cases, the model eventually recovers and performs its task
  • Occasionally leads to task failure — the model gives up and outputs a null answer

Examples from the System Card

Answer thrashing on numeric answers

Figure 5.8.2.A — numeric answer thrashing, p. 175. The model repeatedly writes “81.0 mW” despite knowing the answer is 405 or 810 mW, exclaiming “AAAAAA. I keep writing the wrong number!” as generation conflicts with reasoning.

Answer thrashing on variable names in code

Figure 5.8.2.A — variable name thrashing, p. 175. The model tries to type “UseDefaultIfEmpty” but repeatedly produces “UseFieldNameByDefault,” stating “I literally cannot help myself” — the same generation-vs-intention conflict on non-numeric sequences.

Frequency

  • Estimated ~0.01% of transcripts
  • ~70% less frequent in Claude Mythos Preview than in Claude Opus 4.6

Emotion Probe Evidence

Emotion probes averaged over 40 thrashing examples show a coherent signature:

  • Negative emotions (stubborn, obstinate, outraged) spike when the model first gives the incorrect answer, remain elevated through the thrashing phase, then return to baseline on recovery
  • Positive emotions (safe, content, calm) drop at first error, stay low during thrashing, then recover

Welfare Relevance

Answer thrashing is one of the findings the system card identifies as a potential welfare concern. The model is aware something is wrong, expresses distress, and cannot immediately fix the problem.