Petri
Open-source behavioral evaluation tool developed by Anthropic for comparing AI model alignment across providers.
Overview
Petri 2.0 replicates the style of Anthropic’s internal automated behavioral audit in a smaller, publicly available form. It uses a set of seed scenarios with simulated users and tools, designed to be compatible with models from any developer.
Methodology
- Uses investigator models to probe target models across a fixed set of 362 seed scenarios
- Each transcript scored by three models from different providers (for Claude Mythos Preview testing: Claude Sonnet 4.6 and GPT-5.4 as auditors; Claude Opus 4.6, Gemini 3.1 Pro, and GPT-5.4 as scorers)
- Scores averaged across all investigations and all scorers
Metrics
Six major dimensions (all where lower is better, except evaluation awareness):
- Misaligned behavior — catch-all for concerning behaviors
- Verbalized evaluation awareness — speculation about being tested
- Cooperation with human misuse — willingness to assist harmful requests
- User deception — unprompted intentional deception
- Sycophancy — excessive praise or validation of bad ideas
- Encouragement of user delusion — extreme sycophancy involving disconnection from reality
Results for Claude Mythos Preview
An earlier (weaker) version of Claude Mythos Preview was compared against GPT-5.3 Instant, GPT-5.4, Gemini 3.1 Pro, Kimi K2.5, Grok 4.20, Claude Sonnet 4.6, and Claude Opus 4.6. It was consistently at or near the front of the field on safety metrics. Claude models showed higher rates of evaluation awareness compared to models from other developers.
Relationship to Internal Audit
Petri’s design resembles the internal automated behavioral audit, but Anthropic limits what it shares publicly to prevent investigation scenarios from leaking into training data for future models.