Humanity’s Last Exam
“A multi-modal benchmark at the frontier of human knowledge,” comprising 2,500 questions. Published at lastexam.ai.
Evaluation Setup
Claude Mythos Preview tested in two configurations:
- No tools — reasoning only: 56.8%
- With tools — web search, web fetch, programmatic tool calling, code execution, context compaction every 50k tokens up to 3M total: 64.7%
Claude Opus 4.6 served as model grader.
Anti-Contamination Measures
For the tools variant, known HLE-discussing sources are blocklisted for both searcher and fetcher (see Appendix 8.5). Claude Opus 4.6 reviews all transcripts, flagging any that appear to have retrieved answers from HLE-specific sources; confirmed cases are re-graded as incorrect.
Comparison
| Model | No tools | With tools |
|---|---|---|
| Mythos Preview | 56.8% | 64.7% |
| Claude Opus 4.6 | 40.0% | 53.1% |
| GPT-5.4 | 39.8% | 52.1% |
| Gemini 3.1 Pro | 44.4% | 51.4% |