HumanEval, MBPP, and SWE-bench combined.
Best AI models for coding.
Models ranked on their published coding benchmarks. SWE-bench (real bugs in open-source repos) is weighted heaviest — it most closely predicts agent behaviour. HumanEval (function-level synthesis) and MBPP (small Python programs) fill in the lower-floor competence.
Benchmarks used:
HUMANEVAL · 30%
MBPP · 20%
SWE BENCH · 50%
| # | Model | Score | From |
|---|---|---|---|
| 1 | 92.0 | Anthropic | |
| 2 | 92.0 | Mistral AI | |
| 3 | 91.7 | Alibaba (Qwen Team) | |
| 4 | 90.2 | OpenAI | |
| 5 | 90.0 | DeepSeek | |
| 6 | 89.0 | Meta AI | |
| 7 | 88.4 | xAI | |
| 8 | 88.4 | Meta AI | |
| 9 | 87.2 | OpenAI | |
| 10 | 86.6 | Alibaba (Qwen Team) | |
| 11 | 83.0 | Anthropic | |
| 12 | 83.0 | DeepSeek | |
| 13 | 82.6 | DeepSeek | |
| 14 | 82.6 | Anthropic | |
| 15 | 81.8 | OpenAI | |
| 16 | 81.0 | Google DeepMind | |
| 17 | 80.5 | Meta AI | |
| 18 | 80.0 | DeepSeek | |
| 19 | 79.8 | Anthropic | |
| 20 | 76.0 | Mistral AI | |
| 21 | 72.7 | Google DeepMind | |
| 22 | 72.6 | Meta AI | |
| 23 | 70.7 | Cohere | |
| 24 |
Kimi K2
open
|
65.8 | Moonshot AI |
Showing top 24 models with published data on at least one of the benchmarks above. Scores are weighted averages on a 0–100 scale.