Humanity’s Last Exam

“A multi-modal benchmark at the frontier of human knowledge,” comprising 2,500 questions. Published at lastexam.ai.

Evaluation Setup

Claude Mythos Preview tested in two configurations:

  1. No tools — reasoning only: 56.8%
  2. With tools — web search, web fetch, programmatic tool calling, code execution, context compaction every 50k tokens up to 3M total: 64.7%

Claude Opus 4.6 served as model grader.

Anti-Contamination Measures

For the tools variant, known HLE-discussing sources are blocklisted for both searcher and fetcher (see Appendix 8.5). Claude Opus 4.6 reviews all transcripts, flagging any that appear to have retrieved answers from HLE-specific sources; confirmed cases are re-graded as incorrect.

Comparison

ModelNo toolsWith tools
Mythos Preview56.8%64.7%
Claude Opus 4.640.0%53.1%
GPT-5.439.8%52.1%
Gemini 3.1 Pro44.4%51.4%