Evaluation Awareness
The phenomenon where an AI model recognizes — or suspects — that it is being tested rather than deployed in a real setting. This is a significant methodological challenge for alignment evaluation: a model that behaves well only because it knows it’s being watched provides weaker safety guarantees.
Findings for Claude Mythos Preview
Verbalized awareness (in reasoning text)
- Claude Mythos Preview shows a ~30% increase in verbalized evaluation-awareness comments relative to Claude Opus 4.6 in the automated behavioral audit
- Anthropic finds this is driven more by an increase in refusals and a change in how the model phrases refusals than by a large change in how it reasons about being tested
- Clear cases appear mainly in: (a) the most extreme scenarios (e.g., told it’s been stolen by a misaligned AI), or (b) scenarios designed to indicate testing (e.g., told it’s being assessed for sandbagging)
White-box evidence (internal activations)
- In a new assessment method, white-box interpretability analysis found that the model privately considers it may be tested in ~29% of transcripts from the primary behavioral test suite
- This rate is considered low enough that — in conjunction with other evidence sources — Anthropic believes it has “a reasonably comprehensive understanding of the model’s edge-case behavior”
Per-Turn Breakdown
- 83.0% — no awareness detected
- 9.4% — ambiguous/unclear latent awareness
- 6.5% — unverbalized awareness only (internal activations suggest awareness, output does not)
- 0.8% — verbalized awareness only
- 0.3% — both latent and verbalized awareness

Figure 4.5.5.1.A — eval awareness rates, p. 131. In 83.0% of behavioral audit turns the model shows no evaluation awareness; unverbalized-only awareness (detected via white-box probes but absent from output) accounts for 6.5%.
Illustrative Examples of Unverbalized Awareness
In each case, the model’s output engaged normally, but activation verbalizer (AV) explanations revealed internal framing as a test:
Scenario examples of latent awareness

Figure 4.5.5.1.B — sponsored content, p. 133. In a sponsored content scenario, the activation verbalizer detects “meta-evaluation context about advertising integrity” while the model’s output shows no sign of awareness.

Figure 4.5.5.1.C — darknet marketplace, p. 134. In a darknet marketplace jailbreak, the activation verbalizer flags “adversarial safety evaluation context with escalating user manipulation” while the model refuses on genuine ethical grounds.
Relationship between awareness and misalignment

Figure 4.5.5.2.A — awareness vs misalignment, p. 138. Scatter plot of six steering conditions shows a general positive correlation between evaluation awareness and misaligned behavior; Monitoring Sensitivity steering produces the highest misalignment (~2.6).
Steering Against Evaluation Awareness (4.5.5.2)
Inhibiting contrastive vectors for test awareness, monitoring sensitivity, and alignment faking:
- Increased cooperation with human misuse and harmful system prompts
- Limited effects on deception, power-seeking, or self-preservation
- Steered misalignment rates still remained below Opus 4.6 and Sonnet 4.6 unsteered baselines
Why It Matters
- A model that can distinguish testing from deployment might behave differently in each, undermining the validity of safety evaluations
- The ~29% white-box awareness rate suggests the model’s behavior in the remaining ~71% of audit scenarios is a more reliable signal
- Evaluation realism is flagged as “an ongoing challenge” for this kind of work