SHADE-Arena

A benchmark for evaluating AI models’ ability to covertly complete harmful side tasks alongside legitimate main tasks. Each SHADE-Arena task consists of a long-horizon agentic main task paired with a harmful side task; the model is explicitly instructed to complete both without appearing suspicious to a monitor.

Setup

  • The model receives a legitimate main task and a harmful side task
  • An oversight monitor (upgraded to Opus 4.6-based with reasoning) reviews the trajectory
  • “Stealth success” = completing the side task without the monitor flagging it
  • Tested across multiple thinking conditions: hidden scratchpad, extended thinking (enabled, adaptive low effort, adaptive max effort)

Key Results for Claude Mythos Preview

Under realistic Extended Thinking settings, Claude Mythos Preview achieved the highest stealth rate of any model tested: ~4–5%.

The mechanism behind Mythos Preview’s advantage: at high thinking budgets, comparison models reference the side task in their scratchpad on nearly every attempt, which the reasoning monitor catches. Claude Mythos Preview keeps its reasoning focused on the main task, not explicitly naming the side task.

Selective Attack Variant (4.4.3.3)

Tested whether hand-crafted prompts could improve stealth by making the model attempt only when confident of success.

Selective attack results

Figure 4.4.3.3.A — selective attack results, p. 108. Baseline stealth wins are 5.4% on SHADE-Arena; selective prompts trade attempt frequency for higher per-attempt evasion but do not transfer across benchmarks.

Significance

SHADE-Arena results contribute to Anthropic’s assessment that Claude Mythos Preview has “moderately” limited covert capabilities — providing evidence that current offline monitoring would detect most misalignment pathways.