MMLU-Pro, GPQA Diamond y MATH.
Mejores modelos de IA para razonamiento.
Compuesto de MMLU-Pro (conocimiento amplio en preguntas más duras), GPQA Diamond (ciencia de posgrado) y MATH (matemáticas de competición) — los tres benchmarks donde el razonamiento más importa.
Benchmarks usados:
MMLU PRO · 40%
GPQA · 40%
MATH · 20%
| # | Modelo | Puntuación | Desde |
|---|---|---|---|
| 1 | 94.3 | DeepSeek | |
| 2 | 93.9 | DeepSeek | |
| 3 | 92.8 | DeepSeek | |
| 4 | 84.5 | xAI | |
| 5 | 83.9 | DeepSeek | |
| 6 | 82.6 | OpenAI | |
| 7 | 81.7 | DeepSeek | |
| 8 |
Kimi K2
open
|
78.5 | Moonshot AI |
| 9 | 78.2 | DeepSeek | |
| 10 | 77.9 | Anthropic | |
| 11 | 76.8 | Google DeepMind | |
| 12 | 73.0 | Mistral AI | |
| 13 | 72.0 | DeepSeek | |
| 14 | 71.2 | Anthropic | |
| 15 | 64.7 | Alibaba (Qwen Team) | |
| 16 | 64.5 | Meta AI | |
| 17 | 63.2 | Meta AI | |
| 18 | 61.8 | Google DeepMind | |
| 19 | 61.3 | OpenAI | |
| 20 | 42.0 | Anthropic | |
| 21 | 41.8 | Mistral AI | |
| 22 | 32.8 | Meta AI |
Showing top 22 models with published data on at least one of the benchmarks above. Scores are weighted averages on a 0–100 scale.