Kimi K2.
Moonshot's frontier open-weight MoE — 1T total, 32B active.
4× AMD MI300.
Most-aggressive quantisation we have a working recommendation for. Lower precision = less VRAM = cheaper hardware, at a small accuracy cost.
Cheapest hosted endpoints.
Speed across providers.
Tokens/sec and time-to-first-token measured against the same prompt template on each provider's API.
| Provider | Tokens/sec | TTFT | Total |
|---|---|---|---|
| OpenRouter | 14.5 | 2170 ms | 6284 ms |
Smaller models distilled from Kimi K2.
Lightweight student models trained to mimic Kimi K2's outputs.
Variants in the Kimi family.
Moonshot's open-weight reasoning variant — extended chain-of-thought training...
Multimodal agentic variant — adds a vision encoder to the K2 backbone.
Long-horizon coding + autonomous-execution upgrade over K2.5.
Frequently asked.
How do I run Kimi K2?
Where can I access Kimi K2?
How much does it cost to run Kimi K2?
Is Kimi K2 open-source or proprietary?
Cheapest hardware per quantisation.
Each row is one quantisation tier (the same weights compressed differently). Lower precision → lower VRAM → cheaper hardware, at the cost of small accuracy loss. $/hr refreshed hourly from each provider's API.
| Quantisation | Cheapest GPU config | Total VRAM | Live $/hr | tokens/sec | |
|---|---|---|---|---|---|
|
FP8
FP8 — 8-bit float (Hopper / Blackwell)
|
564 GB | $1.86/hr | — | Compare → | |
|
INT4
INT4 — 4-bit integer (~4× VRAM saving)
|
768 GB | — | — | Compare → |
What it costs per month across providers.
Estimate your monthly bill for Kimi K2 across every host that publishes per-token pricing. Slide your token volumes; the chart + table re-rank cheapest-first.
Cheapest provider on the left.
Total monthly cost — input + output tokens combined.
Bill breakdown.
Rent the GPU instead of paying per token.
For an open-weights model like Kimi K2, you can rent a GPU and serve inference yourself. The math: cheapest GPU rental × 730 hours/month + your electricity rate × power draw.
Assumes the GPU runs 24/7 at ~85% utilisation. If your traffic is bursty, you'll pay less for the API and probably more for the GPU (idle hours still cost rental). The breakeven analysis lives on the Self-host vs API breakeven tool.
What it's best at.
Scores normalised against benchmark ceilings (100 = perfect). Coloured by tier — coral 80+ frontier, lavender 65+ strong, sage 50+ solid, slate below.
Published scores.
| Benchmark | Score | Source |
|---|---|---|
| MMLU-Pro | 78.5 | official ↗ |
| SWE-bench | 65.8 | official ↗ |
About Kimi K2.
Kimi K2 is Moonshot AI's frontier open-weight model — 1T parameters total (Mixture-of-Experts) with 32B activated per token. Trained primarily for agentic tasks; tops several SWE-bench-style coding benchmarks at the open-weight tier. The K2 release made waves in mid-2025 by matching closed-frontier models on coding and tool use while remaining fully open. Available on Hugging Face under a modified MIT license. Cost-competitive with Claude Sonnet on Moonshot's own API. 256K context.
How it's built.
How much it can remember.
What it can do.
Every place this model is hosted.
Kimi.com
chat uiFree for the consumer chat surface; per-message limits apply.
Kimi.com
chat uiFree for the consumer chat surface; per-message limits apply.
Self-hosted on rented GPU cluster
self hostedMulti-GPU MoE deployment
Moonshot AI Platform
api directPricing inferred from OpenRouter (within ~5-20% of direct API rates).