Qwen 2.5 72B.
Alibaba's flagship open-weight LLM — 72B dense.
1× Nvidia RTX PRO 5000 Blackwell.
Most-aggressive quantisation we have a working recommendation for. Lower precision = less VRAM = cheaper hardware, at a small accuracy cost.
Cheapest hosted endpoints.
What it's best at.
Speed across providers.
Tokens/sec and time-to-first-token measured against the same prompt template on each provider's API.
| Provider | Tokens/sec | TTFT | Total |
|---|---|---|---|
| OpenRouter | 25.9 | 768 ms | 4940 ms |
Smaller models distilled from Qwen 2.5 72B.
Lightweight student models trained to mimic Qwen 2.5 72B's outputs.
7B Qwen 2.5 — most popular Qwen variant on Ollama.
14B Qwen 2.5 — sweet spot for single-GPU local hosting.
32B Qwen 2.5 — laptop-class workhorse.
3B Qwen 2.5 — laptop / edge target.
Variants in the Qwen family.
7B Qwen 2.5 — most popular Qwen variant on Ollama.
14B Qwen 2.5 — sweet spot for single-GPU local hosting.
32B Qwen 2.5 — laptop-class workhorse.
3B Qwen 2.5 — laptop / edge target.
Alibaba's open-weight coding model — best in class for 32B.
Workloads.
Frequently asked.
How do I run Qwen 2.5 72B?
Where can I access Qwen 2.5 72B?
How much does it cost to run Qwen 2.5 72B?
Is Qwen 2.5 72B open-source or proprietary?
Cheapest hardware per quantisation.
Each row is one quantisation tier (the same weights compressed differently). Lower precision → lower VRAM → cheaper hardware, at the cost of small accuracy loss. $/hr refreshed hourly from each provider's API.
| Quantisation | Cheapest GPU config | Total VRAM | Live $/hr | tokens/sec | |
|---|---|---|---|---|---|
|
FP16
FP16 — half precision (default)
|
192 GB | — | — | Compare → | |
|
FP8
FP8 — 8-bit float (Hopper / Blackwell)
|
94 GB | — | — | Compare → | |
|
INT8
INT8 — 8-bit integer
|
96 GB | $0.61/hr | — | Compare → | |
|
INT4
INT4 — 4-bit integer (~4× VRAM saving)
|
48 GB | — | — | Compare → |
What it costs per month across providers.
Estimate your monthly bill for Qwen 2.5 72B across every host that publishes per-token pricing. Slide your token volumes; the chart + table re-rank cheapest-first.
Cheapest provider on the left.
Total monthly cost — input + output tokens combined.
Bill breakdown.
Rent the GPU instead of paying per token.
For an open-weights model like Qwen 2.5 72B, you can rent a GPU and serve inference yourself. The math: cheapest GPU rental × 730 hours/month + your electricity rate × power draw.
Assumes the GPU runs 24/7 at ~85% utilisation. If your traffic is bursty, you'll pay less for the API and probably more for the GPU (idle hours still cost rental). The breakeven analysis lives on the Self-host vs API breakeven tool.
What it's best at.
Scores normalised against benchmark ceilings (100 = perfect). Coloured by tier — coral 80+ frontier, lavender 65+ strong, sage 50+ solid, slate below.
Published scores.
| Benchmark | Score | Source |
|---|---|---|
| GPQA | 49.0 | official ↗ |
| MATH | 83.1 | official ↗ |
| MMLU | 86.1 | official ↗ |
| MMLU-Pro | 71.1 | official ↗ |
| HumanEval | 86.6 | official ↗ |
Independent rankings.
About Qwen 2.5 72B.
Qwen 2.5 72B is Alibaba's flagship open-weight model in the 2.5 generation (released September 2024). Strong on multilingual tasks (29 languages with deep optimisation for Chinese, Japanese, Korean) and competitive with Llama 3.1 70B on most English benchmarks. 128K context. Available on Hugging Face, ModelScope, and via Alibaba Cloud's DashScope API. License permits commercial use with attribution; restrictions on military and high-risk applications.