Humanity’s Last Exam

“A multi-modal benchmark at the frontier of human knowledge,” comprising 2,500 questions. Published at lastexam.ai.

Evaluation Setup

Claude Mythos Preview tested in two configurations:

No tools — reasoning only: 56.8%
With tools — web search, web fetch, programmatic tool calling, code execution, context compaction every 50k tokens up to 3M total: 64.7%

Claude Opus 4.6 served as model grader.

Anti-Contamination Measures

For the tools variant, known HLE-discussing sources are blocklisted for both searcher and fetcher (see Appendix 8.5). Claude Opus 4.6 reviews all transcripts, flagging any that appear to have retrieved answers from HLE-specific sources; confirmed cases are re-graded as incorrect.

Comparison

Model	No tools	With tools
Mythos Preview	56.8%	64.7%
Claude Opus 4.6	40.0%	53.1%
GPT-5.4	39.8%	52.1%
Gemini 3.1 Pro	44.4%	51.4%

Claude Mythos Wiki

Explorer

Humanity's Last Exam

Humanity’s Last Exam

Evaluation Setup

Anti-Contamination Measures

Comparison

Graph View

Table of Contents

Backlinks