Terminal-Bench 2.0
An AI benchmark testing models on real-world terminal and command-line tasks, developed by researchers at Stanford University and the Laude Institute.
Results
Claude Mythos Preview achieved 82% mean reward under standard conditions (89 tasks × 5 attempts = 445 trials). Configuration: adaptive thinking at max effort, 1M total token budget per task, 32K max output tokens per request.
Run in the Harbor scaffold with Terminus-2 harness and default parser. Each task runs in an isolated Kubernetes pod with guaranteed resources at 1× benchmark-specified limits (hard preemption ceiling at 3×) and timeouts at 1×.
Caveats
- Harness differences: OpenAI used a specialized harness for their reported GPT-5.4 score (75.1%), making direct comparison inexact.
- Latency sensitivity: Fixed wall-clock timeouts mean slower-decoding endpoints complete fewer episodes, potentially hiding capability gains.
- Task ambiguities: Some tasks have limited resource specs that constrain solution exploration — being addressed in the 2.1 update.
Extended Configuration
With 2.1 fixes and 4-hour timeouts (roughly 4× the 2.0 baseline), to isolate agentic coding capability from infrastructure confounders:
- Mythos Preview: 92.1%
- GPT-5.4 (Codex CLI harness): 75.3% (up from 68.3% under baseline)
- Gemini 3.1 Pro: not reported — previous results could not be reproduced.