Responsible Scaling Policy (RSP)

Anthropic’s framework governing the safe development and release of frontier AI models. Claude Mythos Preview is the first model evaluated under RSP 3.0 (the third version).

Key Mechanisms

  • Risk Reports: Regularly published comprehensive assessments of model safety profiles
  • Capability thresholds: If a new model is “significantly more capable” than those in the prior Risk Report, Anthropic must publish a discussion of how capabilities affect or change the risk analysis
  • Threat models assessed:
    • Chemical and biological weapons production (novel and non-novel)
    • Risks from misaligned models
    • Automated AI R&D (threshold: compressing two years of progress into one)
    • Autonomy risks
    • Cyber risks

RSP 3.0 vs Previous Versions

  • No longer uses “AI Safety Levels” for capability thresholds (though still uses the term for clusters of risk mitigations)
  • Less emphasis on binary “rule-in/rule-out” thresholds; more emphasis on overall risk assessments
  • Regular Risk Reports now required alongside System Cards
  • Part of the Frontier Compliance Framework
  • The decision not to release Claude Mythos Preview publicly was not driven by RSP requirements — it was a separate judgment about cyber capability risks

Risk Assessment Process

  • Capability evaluations → internal feedback → Responsible Scaling Officer determination
  • Evidence gathered from automated evaluations, uplift trials, third-party expert red teaming, and third-party assessments
  • Multiple model snapshots evaluated; final determination based on production release candidates and training trends

Threat Models

  • CB-1: Known chemical/biological weapons production — can the model significantly help non-experts?
  • CB-2: Novel chemical/biological weapons production — can the model help expert teams create novel weapons?
  • Autonomy TM1: Early-stage misalignment risk — misaligned autonomous actions with irreversible consequences
  • Autonomy TM2: Automated R&D — AI systems that dramatically accelerate AI research itself

Findings for Claude Mythos Preview

  • Catastrophic risks remain low across all assessed threat models
  • CB-1: Threshold likely crossed; mitigations (classifier guards, access controls) applied; risk “very low but not negligible”
  • CB-2: Threshold not crossed; model lacks novel scientific reasoning capabilities
  • Autonomy TM1: Applicable; separate risk assessment released; risk “very low, but higher than for previous models”
  • Autonomy TM2: Not applicable; ECI shows upward bend (1.86x–4.3x slope ratio) but attributed to human research; held with less confidence than any prior model
  • Anthropic acknowledges increasing reliance on subjective judgments rather than easy-to-interpret empirical results