Best AI models, by task.
Composite rankings from published benchmarks. Each board picks the right benchmark mix for one job — coding, reasoning, math, vision, knowledge, instruction-following, or quality-per-dollar.
Best AI models for coding
Models ranked on their published coding benchmarks. SWE-bench (real bugs in open-source repos) is weighted heaviest — it most closely predicts agent behaviour. HumanEval (functi...
Best AI models for reasoning
A composite of MMLU-Pro (broad knowledge under harder questions), GPQA Diamond (graduate-level science), and MATH (competition math) — the three benchmarks where reasoning skill...
Best AI models for math
MATH (competition-level problems, formal proofs) weighted heaviest, GSM8K (grade-school word problems) as the floor. Models that win both handle algebra, calculus, and chain-of-...
Best AI models for general knowledge
MMLU measures breadth across 57 academic subjects; MMLU-Pro raises the bar on the same domains. A high score means the model knows a lot before it has to reason.
Best AI models for instruction-following
IFEval scores whether a model obeys constraints — word counts, JSON formats, specific phrasings. The score that translates to production agent reliability.
Best AI models for vision
MMMU evaluates models on college-level questions paired with diagrams, charts, and images. Sourced from each model's official MMMU submission.
Cheapest capable AI models
Composite of MMLU and HumanEval divided by per-million input-token API price. Frontier models cost a lot; this list surfaces the cheapest options that still hold up on the basics.