MMLU-Pro、GPQA Diamond 和 MATH。
推理最佳 AI 模型.
MMLU-Pro(更难的广泛知识)、GPQA Diamond(研究生级科学)和 MATH(竞赛数学)的综合 — 推理技能最重要的三个基准。
使用的基准:
MMLU PRO · 40%
GPQA · 40%
MATH · 20%
| # | 模型 | 分数 | 来自 |
|---|---|---|---|
| 1 | 94.3 | DeepSeek | |
| 2 | 93.9 | DeepSeek | |
| 3 | 92.8 | DeepSeek | |
| 4 | 84.5 | xAI | |
| 5 | 83.9 | DeepSeek | |
| 6 | 82.6 | OpenAI | |
| 7 | 81.7 | DeepSeek | |
| 8 |
Kimi K2
open
|
78.5 | Moonshot AI |
| 9 | 78.2 | DeepSeek | |
| 10 | 77.9 | Anthropic | |
| 11 | 76.8 | Google DeepMind | |
| 12 | 73.0 | Mistral AI | |
| 13 | 72.0 | DeepSeek | |
| 14 | 71.2 | Anthropic | |
| 15 | 64.7 | Alibaba (Qwen Team) | |
| 16 | 64.5 | Meta AI | |
| 17 | 63.2 | Meta AI | |
| 18 | 61.8 | Google DeepMind | |
| 19 | 61.3 | OpenAI | |
| 20 | 42.0 | Anthropic | |
| 21 | 41.8 | Mistral AI | |
| 22 | 32.8 | Meta AI |
Showing top 22 models with published data on at least one of the benchmarks above. Scores are weighted averages on a 0–100 scale.
AI 模型排行榜