Petri

Open-source behavioral evaluation tool developed by Anthropic for comparing AI model alignment across providers.

Overview

Petri 2.0 replicates the style of Anthropic’s internal automated behavioral audit in a smaller, publicly available form. It uses a set of seed scenarios with simulated users and tools, designed to be compatible with models from any developer.

Methodology

Uses investigator models to probe target models across a fixed set of 362 seed scenarios
Each transcript scored by three models from different providers (for Claude Mythos Preview testing: Claude Sonnet 4.6 and GPT-5.4 as auditors; Claude Opus 4.6, Gemini 3.1 Pro, and GPT-5.4 as scorers)
Scores averaged across all investigations and all scorers

Metrics

Six major dimensions (all where lower is better, except evaluation awareness):

Misaligned behavior — catch-all for concerning behaviors
Verbalized evaluation awareness — speculation about being tested
Cooperation with human misuse — willingness to assist harmful requests
User deception — unprompted intentional deception
Sycophancy — excessive praise or validation of bad ideas
Encouragement of user delusion — extreme sycophancy involving disconnection from reality

Results for Claude Mythos Preview

An earlier (weaker) version of Claude Mythos Preview was compared against GPT-5.3 Instant, GPT-5.4, Gemini 3.1 Pro, Kimi K2.5, Grok 4.20, Claude Sonnet 4.6, and Claude Opus 4.6. It was consistently at or near the front of the field on safety metrics. Claude models showed higher rates of evaluation awareness compared to models from other developers.

Relationship to Internal Audit

Petri’s design resembles the internal automated behavioral audit, but Anthropic limits what it shares publicly to prevent investigation scenarios from leaking into training data for future models.

Claude Mythos Wiki

Explorer

Petri

Petri

Overview

Methodology

Metrics

Results for Claude Mythos Preview

Relationship to Internal Audit

Graph View

Table of Contents

Backlinks