Autonomy Threat Models
Two autonomy-related threat models defined in Anthropic’s Responsible Scaling Policy. These assess whether frontier AI models pose risks through misaligned autonomous action or by accelerating AI research itself.
Threat Model 1: Early-Stage Misalignment Risk
AI systems that are highly relied on and have extensive access to sensitive assets as well as moderate capacity for autonomous, goal-directed operation and subterfuge — such that it is plausible these AI systems could carry out actions leading to irreversibly and substantially higher odds of a later global catastrophe.
Key concern: Models with enough autonomy and access that misaligned behavior — whether deliberate or inadvertent — could cause irreversible harm.
This maps to the “High-stakes sabotage opportunities” threat model in RSP 3.0.
Claude Mythos Preview Assessment
Threat model 1 is applicable to Claude Mythos Preview (as it was to some previous models). The improved capabilities and potential for different alignment properties mean it could “significantly affect” the previous risk assessment. Anthropic released a separate overall risk assessment. Conclusion: risk is very low, but higher than for previous models.
Threat Model 2: Risks from Automated R&D
AI systems that can fully automate, or otherwise dramatically accelerate, the work of large, top-tier teams of human researchers in domains where fast progress could cause threats to international security and/or rapid disruptions to the global balance of power — for example, energy, robotics, weapons development and AI itself.
Key concern: A feedback loop where AI development accelerates AI development, potentially leading to rapid, destabilizing capability gains.
Operationalization
RSP 3.1 defines the threshold as either:
- The ability to substitute for an entire set of Research Scientists and Engineers at competitive costs, or
- Dramatic acceleration (e.g., doubling) the pace of AI progress from automation of AI R&D
Important nuance: Anthropic distinguishes between productivity uplift (how much more work one person can do) and acceleration of progress (how fast the field advances). A ~4x productivity uplift does not translate to 2x progress acceleration due to diminishing returns.
Claude Mythos Preview Assessment
Threat model 2 not applicable, but held with less confidence than for any prior model.
Evidence considered:
- Task-based evaluations: Model exceeds top human performance on all automated R&D tasks, but the suite is now saturated and no longer discriminating
- Internal survey (n=18): Only 1/18 thought the model could replace an entry-level researcher; key gaps in self-managing ambiguous tasks, understanding org priorities, taste, verification
- Qualitative judgment: Extensive internal use shows the model “does not seem close” to substituting for senior Research Scientists/Engineers
- ECI slope ratio: 1.86x–4.3x upward bend detected, but attributed to human research advances
- External testing: METR and Epoch AI found significant step-up in research utility, but also deficits in judgment, hypothesis testing, and overconfident conclusions
Four Reasons the Acceleration Is Not AI-Attributable
- Gains confidently traced to specific human research advances (confirmed by interviewing the people involved)
- The slope measurement looks backward — reflects what went into building the model, not the model’s own contributions
- ~4x observed productivity uplift is well below what would be needed for 2x progress acceleration
- Early claims of large AI-attributable research wins did not hold up on investigation
Why These Matter
Together, these threat models frame the core existential risk question for frontier AI: Is the model dangerous through misaligned action (TM1) or through accelerating its own successors (TM2)? Anthropic explicitly notes that both determinations involve increasing uncertainty at the frontier.