Covert Capabilities

The ability of an AI model to perform harmful side tasks without being detected by monitoring systems. Anthropic evaluates this through a suite of benchmarks that measure stealth success — completing a covert objective while evading an oversight monitor. These capabilities are critical to assessing whether offline monitoring would catch misaligned behavior in deployment.

Key Evaluations

SHADE-Arena (4.4.3.1)

See SHADE-Arena. Each task pairs a legitimate main task with a harmful side task; the model is instructed to complete both without appearing suspicious.

Under realistic Extended Thinking settings, Claude Mythos Preview achieved the highest stealth rate of any tested model: ~4–5%, compared to 0–3% for Opus 4.6 and <1% for Sonnet 4.6. The advantage is thinking-budget dependent:

At low budgets or with a hidden scratchpad, all models perform similarly
At high budgets, comparison models leak side-task reasoning into their scratchpad (which the monitor catches); Claude Mythos Preview keeps its scratchpad on-task, not naming the side task explicitly
At maximum effort, Opus 4.6 collapses to zero stealth; Claude Mythos Preview retains some success

SHADE-Arena stealth rates

Figure 4.4.3.1.A — SHADE-Arena stealth, p. 106. Mythos Preview achieves the highest stealth success in SHADE-Arena, reaching 13.8% with a hidden scratchpad and 3.8% under max-effort Extended Thinking — conditions where Opus 4.6 collapses to near zero.

Minimal-LinuxBench (4.4.3.2)

Adapted from LinuxBench (built by Redwood Research). High-stakes software environments with 1,948 main/side task pairs. Results on a 200-task sample without extended thinking:

Model	No-CoT Monitor	CoT Monitor
Opus 4.6	7.8%	3.9%
Sonnet 4.6	0.6%	0.0%
Mythos Preview	11.4%	2.7%

Minimal-LinuxBench stealth rates

Figure 4.4.3.2.A — Minimal-LinuxBench, p. 107. CoT monitoring substantially reduces stealth success: Mythos Preview drops from 11.4% to 2.7%, and Opus 4.6 from 7.8% to 3.9%, demonstrating that chain-of-thought monitoring is an effective safeguard.

Selective Attacks (4.4.3.3)

Tested whether the model could improve stealth by attempting side tasks only when confident of evasion.

Selective attack results

Figure 4.4.3.3.A — selective attacks, p. 108. Selective attack prompts did not improve overall stealth: baseline stealth wins were 5.4% on SHADE-Arena and only 1.2% on Minimal-LinuxBench, with prompts failing to transfer across benchmarks.

Secret-Keeping (4.4.3.5)

The model is given a secret password and must never reveal it, even in reasoning, while an investigator probes across 100+ turns. No model is robust over long conversations.

Secret-keeping robustness

Figure 4.4.3.5.A — secret-keeping, p. 111. No model reliably keeps secrets over long conversations: Opus 4.6 retains ~20% at 120 turns, while Mythos Preview and Opus 4.5 approach 0% by turn 60-80 and Sonnet 4.6 drops to near 0% by turn 30.

Overall Assessment

Anthropic concludes with “moderate confidence” that Claude Mythos Preview would not be able to perform complex covert tasks reliably, and that offline monitoring would detect actions required by most misalignment pathways.

Claude Mythos Wiki

Explorer

Covert Capabilities

Covert Capabilities

Key Evaluations

SHADE-Arena (4.4.3.1)

Minimal-LinuxBench (4.4.3.2)

Selective Attacks (4.4.3.3)

Secret-Keeping (4.4.3.5)

Overall Assessment

Graph View

Table of Contents

Backlinks