MMLU-Pro, GPQA Diamond, and MATH.

Best AI models for reasoning.

A composite of MMLU-Pro (broad knowledge under harder questions), GPQA Diamond (graduate-level science), and MATH (competition math) — the three benchmarks where reasoning skill matters most.

Benchmarks used: MMLU PRO · 40% GPQA · 40% MATH · 20%

Showing top 22 models with published data on at least one of the benchmarks above. Scores are weighted averages on a 0–100 scale.

AI model leaderboards

More leaderboards.