MATH and GSM8K.
Best AI models for math.
MATH (competition-level problems, formal proofs) weighted heaviest, GSM8K (grade-school word problems) as the floor. Models that win both handle algebra, calculus, and chain-of-thought arithmetic.
Benchmarks used:
MATH · 70%
GSM8K · 30%
| # | Model | Score | From |
|---|---|---|---|
| 1 | 97.3 | DeepSeek | |
| 2 | 96.0 | OpenAI | |
| 3 | 94.5 | DeepSeek | |
| 4 | 94.3 | DeepSeek | |
| 5 | 93.9 | DeepSeek | |
| 6 | 93.3 | xAI | |
| 7 | 92.8 | DeepSeek | |
| 8 | 92.0 | Google DeepMind | |
| 9 | 90.2 | DeepSeek | |
| 10 | 89.0 | Google DeepMind | |
| 11 | 87.5 | Anthropic | |
| 12 | 83.9 | DeepSeek | |
| 13 | 83.1 | Alibaba (Qwen Team) | |
| 14 | 82.0 | Anthropic | |
| 15 | 77.0 | Meta AI | |
| 16 | 76.6 | OpenAI | |
| 17 | 73.8 | Meta AI | |
| 18 | 73.0 | Mistral AI | |
| 19 | 41.8 | Mistral AI |
Showing top 19 models with published data on at least one of the benchmarks above. Scores are weighted averages on a 0–100 scale.