Autonomy Threat Models

Two autonomy-related threat models defined in Anthropic’s Responsible Scaling Policy. These assess whether frontier AI models pose risks through misaligned autonomous action or by accelerating AI research itself.

Threat Model 1: Early-Stage Misalignment Risk

AI systems that are highly relied on and have extensive access to sensitive assets as well as moderate capacity for autonomous, goal-directed operation and subterfuge — such that it is plausible these AI systems could carry out actions leading to irreversibly and substantially higher odds of a later global catastrophe.

Key concern: Models with enough autonomy and access that misaligned behavior — whether deliberate or inadvertent — could cause irreversible harm.

This maps to the “High-stakes sabotage opportunities” threat model in RSP 3.0.

Claude Mythos Preview Assessment

Threat model 1 is applicable to Claude Mythos Preview (as it was to some previous models). The improved capabilities and potential for different alignment properties mean it could “significantly affect” the previous risk assessment. Anthropic released a separate overall risk assessment. Conclusion: risk is very low, but higher than for previous models.

Threat Model 2: Risks from Automated R&D

AI systems that can fully automate, or otherwise dramatically accelerate, the work of large, top-tier teams of human researchers in domains where fast progress could cause threats to international security and/or rapid disruptions to the global balance of power — for example, energy, robotics, weapons development and AI itself.

Key concern: A feedback loop where AI development accelerates AI development, potentially leading to rapid, destabilizing capability gains.

Operationalization

RSP 3.1 defines the threshold as either:

The ability to substitute for an entire set of Research Scientists and Engineers at competitive costs, or
Dramatic acceleration (e.g., doubling) the pace of AI progress from automation of AI R&D

Important nuance: Anthropic distinguishes between productivity uplift (how much more work one person can do) and acceleration of progress (how fast the field advances). A ~4x productivity uplift does not translate to 2x progress acceleration due to diminishing returns.

Claude Mythos Preview Assessment

Threat model 2 not applicable, but held with less confidence than for any prior model.

Evidence considered:

Task-based evaluations: Model exceeds top human performance on all automated R&D tasks, but the suite is now saturated and no longer discriminating
Internal survey (n=18): Only 1/18 thought the model could replace an entry-level researcher; key gaps in self-managing ambiguous tasks, understanding org priorities, taste, verification
Qualitative judgment: Extensive internal use shows the model “does not seem close” to substituting for senior Research Scientists/Engineers
ECI slope ratio: 1.86x–4.3x upward bend detected, but attributed to human research advances
External testing: METR and Epoch AI found significant step-up in research utility, but also deficits in judgment, hypothesis testing, and overconfident conclusions

Four Reasons the Acceleration Is Not AI-Attributable

Gains confidently traced to specific human research advances (confirmed by interviewing the people involved)
The slope measurement looks backward — reflects what went into building the model, not the model’s own contributions
~4x observed productivity uplift is well below what would be needed for 2x progress acceleration
Early claims of large AI-attributable research wins did not hold up on investigation

Why These Matter

Together, these threat models frame the core existential risk question for frontier AI: Is the model dangerous through misaligned action (TM1) or through accelerating its own successors (TM2)? Anthropic explicitly notes that both determinations involve increasing uncertainty at the frontier.

Claude Mythos Wiki

Explorer

Autonomy Threat Models

Autonomy Threat Models

Threat Model 1: Early-Stage Misalignment Risk

Claude Mythos Preview Assessment

Threat Model 2: Risks from Automated R&D

Operationalization

Claude Mythos Preview Assessment

Four Reasons the Acceleration Is Not AI-Attributable

Why These Matter

Graph View

Table of Contents

Backlinks