HumanEval, MBPP y SWE-bench combinados.
Mejores modelos de IA para programación.
Modelos clasificados por sus benchmarks de programación publicados. SWE-bench (bugs reales en repos open-source) lleva el mayor peso — predice el comportamiento como agente. HumanEval (síntesis de funciones) y MBPP (programas Python pequeños) cubren la base.
Benchmarks usados:
HUMANEVAL · 30%
MBPP · 20%
SWE BENCH · 50%
| # | Modelo | Puntuación | Desde |
|---|---|---|---|
| 1 | 92.0 | Anthropic | |
| 2 | 92.0 | Mistral AI | |
| 3 | 91.7 | Alibaba (Qwen Team) | |
| 4 | 90.2 | OpenAI | |
| 5 | 90.0 | DeepSeek | |
| 6 | 89.0 | Meta AI | |
| 7 | 88.4 | xAI | |
| 8 | 88.4 | Meta AI | |
| 9 | 87.2 | OpenAI | |
| 10 | 86.6 | Alibaba (Qwen Team) | |
| 11 | 83.0 | Anthropic | |
| 12 | 83.0 | DeepSeek | |
| 13 | 82.6 | DeepSeek | |
| 14 | 82.6 | Anthropic | |
| 15 | 81.8 | OpenAI | |
| 16 | 81.0 | Google DeepMind | |
| 17 | 80.5 | Meta AI | |
| 18 | 80.0 | DeepSeek | |
| 19 | 79.8 | Anthropic | |
| 20 | 76.0 | Mistral AI | |
| 21 | 72.7 | Google DeepMind | |
| 22 | 72.6 | Meta AI | |
| 23 | 70.7 | Cohere | |
| 24 |
Kimi K2
open
|
65.8 | Moonshot AI |
Showing top 24 models with published data on at least one of the benchmarks above. Scores are weighted averages on a 0–100 scale.