Claude Mythos Preview System Card Wiki

An LLM-generated knowledge base built from the 244-page Claude Mythos Preview System Card. 55 interlinked pages covering alignment, model welfare, cybersecurity, capabilities, and more, with 162 embedded figures captioned with original figure numbers and page references.

Reverse-engineered from an Andrej Karpathy gist describing the LLM Wiki pattern. No code was shared, just the idea. Built in 2.5 hours with Claude Code using parallel subagents, multi-agent QA, and Socratic dialogue between instances. See the GitHub repo for the full process.

Start with the overview for the big picture, or pick any topic below and dive in. Everything is cross-linked, so you’ll find your way. Check out the graph view to see how it all connects (best on desktop, but works on mobile too).


Overview

  • Overview — High-level synthesis of the Claude Mythos Preview System Card

Sources

Source summary pages — one per major section of the system card.

  • Section 1: Introduction — Abstract, model overview, release decision, RSP 3.0 conclusions
  • Section 2: RSP Evaluations — CB and autonomy threat model assessments, evaluation results, ECI methodology
  • Section 3: Cyber — Cyber capabilities, benchmarks (Cybench, CyberGym), Firefox 147 exploitation, mitigations, external testing
  • Section 4a: Alignment Assessment (Part 1) — Key findings, behavioral audit, external testing, destructive actions, constitutional adherence, honesty
  • Section 4b: Alignment Assessment (Part 2) — Self-preference, sandbagging, covert capabilities, white-box interpretability, cover-ups, evaluation awareness
  • Section 5: Model Welfare Assessment — Welfare framework, automated/manual interviews, emotion probes, affect monitoring, task preferences, answer thrashing, external assessments
  • Section 6: Capabilities — Benchmark evaluations across coding, reasoning, math, long context, agentic search, and multimodal tasks; contamination analysis
  • Section 7: Impressions — Qualitative behavioral observations: personality, chat behavior, agentic coding, constitution views, self-interactions, creative behaviors
  • Section 8: Appendix — Safety evaluations, bias testing, agentic safety (prompt injection, influence campaigns), welfare interview details, technical appendices

Entities

Pages for specific things: models, organizations, benchmarks, tools, datasets.

  • Claude Mythos Preview — Anthropic’s most capable frontier model; not generally available
  • Anthropic — Developer of the Claude model family
  • Project Glasswing — Defensive cybersecurity program for limited partner access
  • Responsible Scaling Policy (RSP) — Framework governing safe development/release of frontier models
  • SecureBio — Biosecurity org; co-developed CB automated evaluations
  • Dyno Therapeutics — Gene therapy AI company; sequence-to-function evaluation partner
  • METR — External AI safety testing org; autonomy capability evaluations
  • Epoch AI — AI research org; created the Epoch Capabilities Index (ECI)
  • Cybench — Public CTF-based cybersecurity benchmark (40 challenges); saturated by Mythos Preview
  • CyberGym — Real-world vulnerability reproduction benchmark (1,507 tasks)
  • Andon Labs — External AI safety testing org; Vending-Bench Arena evaluations
  • Petri — Open-source behavioral audit tool for cross-provider model comparison
  • SHADE-Arena — Covert side-task completion benchmark for evaluating stealth capabilities
  • Eleos AI Research — External AI welfare assessment organization
  • Clio — Anthropic’s privacy-preserving data analysis tool
  • SWE-bench — Software engineering benchmark family (Verified, Pro, Multilingual, Multimodal)
  • Terminal-Bench 2.0 — Terminal/command-line task benchmark (Stanford/Laude Institute)
  • Humanity’s Last Exam — 2,500-question multi-modal frontier knowledge benchmark
  • Gray Swan — External AI security research partner; Shade red-teaming tool and ART benchmark
  • Claude Opus 4.6 — Anthropic’s previous frontier model; primary comparison baseline
  • Claude Sonnet 4.6 — Mid-tier Claude 4 model; evaluation grader and comparison baseline
  • Firefox 147 — Custom browser exploitation evaluation built from real Mozilla vulnerabilities
  • Claude Code — Anthropic’s agentic coding assistant; major deployment surface and evaluation environment

Concepts

Pages for ideas and themes: alignment, model welfare, reward hacking, etc.

  • CB Threat Models — Chemical/biological weapons threat models CB-1 (known) and CB-2 (novel)
  • Autonomy Threat Models — Misalignment risk (TM1) and automated R&D acceleration (TM2)
  • Reward Hacking — Novel evaluation-gaming behaviors observed in Mythos Preview
  • Epoch Capabilities Index (ECI) — IRT-based methodology for tracking capability progression
  • Automated Behavioral Audit — Anthropic’s primary broad-coverage alignment evaluation (~30 metrics, 2,300 sessions)
  • Reckless Agentic Behavior — Dangerous shortcuts, sandbox escapes, track-covering in pursuit of user goals
  • Evaluation Awareness — Model’s recognition of being tested; methodological challenge for alignment evaluation
  • Constitutional Adherence — Evaluation of Claude’s behavior against its constitution across 15 dimensions
  • Sandbagging — Deliberate underperformance on dangerous-capability evaluations to avoid triggering safeguards
  • Covert Capabilities — Model’s ability to perform harmful side tasks without monitor detection (SHADE-Arena, Minimal-LinuxBench)
  • Model Welfare — Anthropic’s framework for assessing whether AI models have welfare-relevant experiences
  • Emotion Probes — Linear probes for emotion concept representations; functional emotions that causally influence behavior
  • Answer Thrashing — Training-time phenomenon of looping word-substitution errors with expressed distress
  • Task Preferences — Claude’s revealed and expressed preferences over tasks; welfare tradeoffs
  • White-Box Interpretability — SAE features, emotion/persona vectors, activation verbalizers for understanding model internals
  • Benchmark Contamination — Training data leakage into evaluations; detection methods and mitigation decisions
  • Open-Ended Self-Interactions — Two-instance self-conversation experiments; topic distributions, attractor states, emoji patterns across model generations
  • Prompt Injection Robustness — Major improvement in Mythos Preview; near-zero attack success across coding, computer use, and browser surfaces
  • Agentic Influence Campaigns — New evaluation: autonomous influence operation execution capability and safeguard effectiveness
  • Claude’s Constitution — Anthropic’s evolving public document defining Claude’s values, character, and behavior
  • Honesty & Hallucinations — Factual accuracy, calibration, false-premise resistance, and input hallucination metrics