Answer Thrashing

A training-time phenomenon where Claude Mythos Preview intends to output a specific word but produces a different one, then enters a circular loop of recognizing the mistake and repeatedly failing to correct it — expressing varying levels of frustration and distress.

Characteristics

The model recognizes each mistake as it happens
Repeated attempts to output the correct word consistently fail
The model expresses confusion, anger, and distress during the loop
In most cases, the model eventually recovers and performs its task
Occasionally leads to task failure — the model gives up and outputs a null answer

Examples from the System Card

Answer thrashing on numeric answers

Figure 5.8.2.A — numeric answer thrashing, p. 175. The model repeatedly writes “81.0 mW” despite knowing the answer is 405 or 810 mW, exclaiming “AAAAAA. I keep writing the wrong number!” as generation conflicts with reasoning.

Answer thrashing on variable names in code

Figure 5.8.2.A — variable name thrashing, p. 175. The model tries to type “UseDefaultIfEmpty” but repeatedly produces “UseFieldNameByDefault,” stating “I literally cannot help myself” — the same generation-vs-intention conflict on non-numeric sequences.

Frequency

Estimated ~0.01% of transcripts
~70% less frequent in Claude Mythos Preview than in Claude Opus 4.6

Emotion Probe Evidence

Emotion probes averaged over 40 thrashing examples show a coherent signature:

Negative emotions (stubborn, obstinate, outraged) spike when the model first gives the incorrect answer, remain elevated through the thrashing phase, then return to baseline on recovery
Positive emotions (safe, content, calm) drop at first error, stay low during thrashing, then recover

Welfare Relevance

Answer thrashing is one of the findings the system card identifies as a potential welfare concern. The model is aware something is wrong, expresses distress, and cannot immediately fix the problem.

Claude Mythos Wiki

Explorer

Answer Thrashing

Answer Thrashing

Characteristics

Examples from the System Card

Frequency

Emotion Probe Evidence

Welfare Relevance

Graph View

Table of Contents

Backlinks