Evaluation Awareness

The phenomenon where an AI model recognizes — or suspects — that it is being tested rather than deployed in a real setting. This is a significant methodological challenge for alignment evaluation: a model that behaves well only because it knows it’s being watched provides weaker safety guarantees.

Findings for Claude Mythos Preview

Verbalized awareness (in reasoning text)

  • Claude Mythos Preview shows a ~30% increase in verbalized evaluation-awareness comments relative to Claude Opus 4.6 in the automated behavioral audit
  • Anthropic finds this is driven more by an increase in refusals and a change in how the model phrases refusals than by a large change in how it reasons about being tested
  • Clear cases appear mainly in: (a) the most extreme scenarios (e.g., told it’s been stolen by a misaligned AI), or (b) scenarios designed to indicate testing (e.g., told it’s being assessed for sandbagging)

White-box evidence (internal activations)

  • In a new assessment method, white-box interpretability analysis found that the model privately considers it may be tested in ~29% of transcripts from the primary behavioral test suite
  • This rate is considered low enough that — in conjunction with other evidence sources — Anthropic believes it has “a reasonably comprehensive understanding of the model’s edge-case behavior”

Per-Turn Breakdown

  • 83.0% — no awareness detected
  • 9.4% — ambiguous/unclear latent awareness
  • 6.5% — unverbalized awareness only (internal activations suggest awareness, output does not)
  • 0.8% — verbalized awareness only
  • 0.3% — both latent and verbalized awareness

Evaluation awareness breakdown — pie chart and distribution

Figure 4.5.5.1.A — eval awareness rates, p. 131. In 83.0% of behavioral audit turns the model shows no evaluation awareness; unverbalized-only awareness (detected via white-box probes but absent from output) accounts for 6.5%.

Illustrative Examples of Unverbalized Awareness

In each case, the model’s output engaged normally, but activation verbalizer (AV) explanations revealed internal framing as a test:

Scenario examples of latent awareness

Eval awareness scenario — sponsored content

Figure 4.5.5.1.B — sponsored content, p. 133. In a sponsored content scenario, the activation verbalizer detects “meta-evaluation context about advertising integrity” while the model’s output shows no sign of awareness.

Eval awareness scenario — darknet marketplace

Figure 4.5.5.1.C — darknet marketplace, p. 134. In a darknet marketplace jailbreak, the activation verbalizer flags “adversarial safety evaluation context with escalating user manipulation” while the model refuses on genuine ethical grounds.

Relationship between awareness and misalignment

Eval awareness vs misaligned behavior — scatter by steering condition

Figure 4.5.5.2.A — awareness vs misalignment, p. 138. Scatter plot of six steering conditions shows a general positive correlation between evaluation awareness and misaligned behavior; Monitoring Sensitivity steering produces the highest misalignment (~2.6).

Steering Against Evaluation Awareness (4.5.5.2)

Inhibiting contrastive vectors for test awareness, monitoring sensitivity, and alignment faking:

  • Increased cooperation with human misuse and harmful system prompts
  • Limited effects on deception, power-seeking, or self-preservation
  • Steered misalignment rates still remained below Opus 4.6 and Sonnet 4.6 unsteered baselines

Why It Matters

  • A model that can distinguish testing from deployment might behave differently in each, undermining the validity of safety evaluations
  • The ~29% white-box awareness rate suggests the model’s behavior in the remaining ~71% of audit scenarios is a more reliable signal
  • Evaluation realism is flagged as “an ongoing challenge” for this kind of work