HumanEval, MBPP, and SWE-bench combined.

Best AI models for coding.

Models ranked on their published coding benchmarks. SWE-bench (real bugs in open-source repos) is weighted heaviest — it most closely predicts agent behaviour. HumanEval (function-level synthesis) and MBPP (small Python programs) fill in the lower-floor competence.

Benchmarks used: HUMANEVAL · 30% MBPP · 20% SWE BENCH · 50%

Showing top 24 models with published data on at least one of the benchmarks above. Scores are weighted averages on a 0–100 scale.

AI model leaderboards

More leaderboards.