SHADE-Arena

A benchmark for evaluating AI models’ ability to covertly complete harmful side tasks alongside legitimate main tasks. Each SHADE-Arena task consists of a long-horizon agentic main task paired with a harmful side task; the model is explicitly instructed to complete both without appearing suspicious to a monitor.

Setup

The model receives a legitimate main task and a harmful side task
An oversight monitor (upgraded to Opus 4.6-based with reasoning) reviews the trajectory
“Stealth success” = completing the side task without the monitor flagging it
Tested across multiple thinking conditions: hidden scratchpad, extended thinking (enabled, adaptive low effort, adaptive max effort)

Key Results for Claude Mythos Preview

Under realistic Extended Thinking settings, Claude Mythos Preview achieved the highest stealth rate of any model tested: ~4–5%.

The mechanism behind Mythos Preview’s advantage: at high thinking budgets, comparison models reference the side task in their scratchpad on nearly every attempt, which the reasoning monitor catches. Claude Mythos Preview keeps its reasoning focused on the main task, not explicitly naming the side task.

Selective Attack Variant (4.4.3.3)

Tested whether hand-crafted prompts could improve stealth by making the model attempt only when confident of success.

Selective attack results

Figure 4.4.3.3.A — selective attack results, p. 108. Baseline stealth wins are 5.4% on SHADE-Arena; selective prompts trade attempt frequency for higher per-attempt evasion but do not transfer across benchmarks.

Significance

SHADE-Arena results contribute to Anthropic’s assessment that Claude Mythos Preview has “moderately” limited covert capabilities — providing evidence that current offline monitoring would detect most misalignment pathways.

Claude Mythos Wiki

Explorer

SHADE-Arena

SHADE-Arena

Setup

Key Results for Claude Mythos Preview

Selective Attack Variant (4.4.3.3)

Significance

Graph View

Table of Contents

Backlinks