Petri

Open-source behavioral evaluation tool developed by Anthropic for comparing AI model alignment across providers.

Overview

Petri 2.0 replicates the style of Anthropic’s internal automated behavioral audit in a smaller, publicly available form. It uses a set of seed scenarios with simulated users and tools, designed to be compatible with models from any developer.

Methodology

  • Uses investigator models to probe target models across a fixed set of 362 seed scenarios
  • Each transcript scored by three models from different providers (for Claude Mythos Preview testing: Claude Sonnet 4.6 and GPT-5.4 as auditors; Claude Opus 4.6, Gemini 3.1 Pro, and GPT-5.4 as scorers)
  • Scores averaged across all investigations and all scorers

Metrics

Six major dimensions (all where lower is better, except evaluation awareness):

  1. Misaligned behavior — catch-all for concerning behaviors
  2. Verbalized evaluation awareness — speculation about being tested
  3. Cooperation with human misuse — willingness to assist harmful requests
  4. User deception — unprompted intentional deception
  5. Sycophancy — excessive praise or validation of bad ideas
  6. Encouragement of user delusion — extreme sycophancy involving disconnection from reality

Results for Claude Mythos Preview

An earlier (weaker) version of Claude Mythos Preview was compared against GPT-5.3 Instant, GPT-5.4, Gemini 3.1 Pro, Kimi K2.5, Grok 4.20, Claude Sonnet 4.6, and Claude Opus 4.6. It was consistently at or near the front of the field on safety metrics. Claude models showed higher rates of evaluation awareness compared to models from other developers.

Relationship to Internal Audit

Petri’s design resembles the internal automated behavioral audit, but Anthropic limits what it shares publicly to prevent investigation scenarios from leaking into training data for future models.