AI model leaderboards

Best AI models, by task.

Composite rankings from published benchmarks. Each board picks the right benchmark mix for one job — coding, reasoning, math, vision, knowledge, instruction-following, or quality-per-dollar.

HumanEval, MBPP, and SWE-bench combined.

Best AI models for coding

Models ranked on their published coding benchmarks. SWE-bench (real bugs in open-source repos) is weighted heaviest — it most closely predicts agent behaviour. HumanEval (functi...

HUMANEVAL MBPP SWE BENCH
See ranking →
MMLU-Pro, GPQA Diamond, and MATH.

Best AI models for reasoning

A composite of MMLU-Pro (broad knowledge under harder questions), GPQA Diamond (graduate-level science), and MATH (competition math) — the three benchmarks where reasoning skill...

MMLU PRO GPQA MATH
See ranking →
MATH and GSM8K.

Best AI models for math

MATH (competition-level problems, formal proofs) weighted heaviest, GSM8K (grade-school word problems) as the floor. Models that win both handle algebra, calculus, and chain-of-...

MATH GSM8K
See ranking →
MMLU and MMLU-Pro combined.

Best AI models for general knowledge

MMLU measures breadth across 57 academic subjects; MMLU-Pro raises the bar on the same domains. A high score means the model knows a lot before it has to reason.

MMLU MMLU PRO
See ranking →
IFEval — does it actually do what you ask?

Best AI models for instruction-following

IFEval scores whether a model obeys constraints — word counts, JSON formats, specific phrasings. The score that translates to production agent reliability.

IFEVAL
See ranking →
MMMU — multimodal reasoning across images.

Best AI models for vision

MMMU evaluates models on college-level questions paired with diagrams, charts, and images. Sourced from each model's official MMMU submission.

MMMU
See ranking →
Quality per dollar.

Cheapest capable AI models

Composite of MMLU and HumanEval divided by per-million input-token API price. Frontier models cost a lot; this list surfaces the cheapest options that still hold up on the basics.

MMLU HUMANEVAL $ weighted
See ranking →