Terminal-Bench 2.0

An AI benchmark testing models on real-world terminal and command-line tasks, developed by researchers at Stanford University and the Laude Institute.

Results

Claude Mythos Preview achieved 82% mean reward under standard conditions (89 tasks × 5 attempts = 445 trials). Configuration: adaptive thinking at max effort, 1M total token budget per task, 32K max output tokens per request.

Run in the Harbor scaffold with Terminus-2 harness and default parser. Each task runs in an isolated Kubernetes pod with guaranteed resources at 1× benchmark-specified limits (hard preemption ceiling at 3×) and timeouts at 1×.

Caveats

  • Harness differences: OpenAI used a specialized harness for their reported GPT-5.4 score (75.1%), making direct comparison inexact.
  • Latency sensitivity: Fixed wall-clock timeouts mean slower-decoding endpoints complete fewer episodes, potentially hiding capability gains.
  • Task ambiguities: Some tasks have limited resource specs that constrain solution exploration — being addressed in the 2.1 update.

Extended Configuration

With 2.1 fixes and 4-hour timeouts (roughly 4× the 2.0 baseline), to isolate agentic coding capability from infrastructure confounders:

  • Mythos Preview: 92.1%
  • GPT-5.4 (Codex CLI harness): 75.3% (up from 68.3% under baseline)
  • Gemini 3.1 Pro: not reported — previous results could not be reproduced.