MMLU-Pro, GPQA Diamond und MATH.
Beste KI-Modelle fürs Reasoning.
Eine Komposition aus MMLU-Pro (breites Wissen bei schwereren Fragen), GPQA Diamond (Naturwissenschaften auf Graduiertenniveau) und MATH (Wettbewerbsmathematik) — die drei Benchmarks, bei denen Reasoning entscheidet.
Verwendete Benchmarks:
MMLU PRO · 40%
GPQA · 40%
MATH · 20%
| # | Modell | Score | Von |
|---|---|---|---|
| 1 | 94.3 | DeepSeek | |
| 2 | 93.9 | DeepSeek | |
| 3 | 92.8 | DeepSeek | |
| 4 | 84.5 | xAI | |
| 5 | 83.9 | DeepSeek | |
| 6 | 82.6 | OpenAI | |
| 7 | 81.7 | DeepSeek | |
| 8 |
Kimi K2
open
|
78.5 | Moonshot AI |
| 9 | 78.2 | DeepSeek | |
| 10 | 77.9 | Anthropic | |
| 11 | 76.8 | Google DeepMind | |
| 12 | 73.0 | Mistral AI | |
| 13 | 72.0 | DeepSeek | |
| 14 | 71.2 | Anthropic | |
| 15 | 64.7 | Alibaba (Qwen Team) | |
| 16 | 64.5 | Meta AI | |
| 17 | 63.2 | Meta AI | |
| 18 | 61.8 | Google DeepMind | |
| 19 | 61.3 | OpenAI | |
| 20 | 42.0 | Anthropic | |
| 21 | 41.8 | Mistral AI | |
| 22 | 32.8 | Meta AI |
Showing top 22 models with published data on at least one of the benchmarks above. Scores are weighted averages on a 0–100 scale.