GLM-4.5.
Zhipu's frontier open-weight MoE — 355B total, 32B active. Strong agentic + reasoning marks for an open model.
1× AMD MI325.
Most-aggressive quantisation we have a working recommendation for. Lower precision = less VRAM = cheaper hardware, at a small accuracy cost.
Cheapest hosted endpoints.
Speed across providers.
Tokens/sec and time-to-first-token measured against the same prompt template on each provider's API.
| Provider | Tokens/sec | TTFT | Total |
|---|---|---|---|
| OpenRouter | 47.7 | 7589 ms | 9255 ms |
Smaller models distilled from GLM-4.5.
Lightweight student models trained to mimic GLM-4.5's outputs.
Variants in the GLM family.
Smaller, cheaper sibling of GLM-4.5. 106B total, 12B active.
Zhipu's GLM 5.1 series — successor to GLM-5 on z.ai's API.
Zhipu's GLM 5 generation — closed flagship between GLM-4.7 and GLM-5.1.
Faster, cheaper sibling of GLM-5 on z.ai.
Mid-generation GLM 4.7 released between GLM-4.6 and GLM-5.
Incremental upgrade on GLM-4.5 — improved reasoning, same context window.
Lowest-latency, lowest-cost variant of GLM-4.5 on z.ai.
Frequently asked.
How do I run GLM-4.5?
Where can I access GLM-4.5?
How much does it cost to run GLM-4.5?
Is GLM-4.5 open-source or proprietary?
Cheapest hardware per quantisation.
Each row is one quantisation tier (the same weights compressed differently). Lower precision → lower VRAM → cheaper hardware, at the cost of small accuracy loss. $/hr refreshed hourly from each provider's API.
| Quantisation | Cheapest GPU config | Total VRAM | Live $/hr | tokens/sec | |
|---|---|---|---|---|---|
|
FP16
FP16 — half precision (default)
|
1024 GB | — | — | Compare → | |
|
FP8
FP8 — 8-bit float (Hopper / Blackwell)
|
512 GB | — | — | Compare → | |
|
INT4
INT4 — 4-bit integer (~4× VRAM saving)
|
256 GB | — | — | Compare → |
What it costs per month across providers.
Estimate your monthly bill for GLM-4.5 across every host that publishes per-token pricing. Slide your token volumes; the chart + table re-rank cheapest-first.
Cheapest provider on the left.
Total monthly cost — input + output tokens combined.
Bill breakdown.
About GLM-4.5.
GLM-4.5 is Zhipu AI's frontier open-weight model, released in July 2025. It uses a Mixture-of-Experts architecture (355B parameters total, 32B active per token) and was trained with a focus on agentic tasks — tool use, multi-step reasoning, and long-form code generation. On launch it placed in the open-model top tier of the LMSYS Arena leaderboard. Available on Hugging Face under MIT licence; the smaller GLM-4.5-Air variant trades parameter count for cheaper inference. Zhipu offers their own API at bigmodel.cn and a consumer chat product at z.ai. 128K context, native function calling, JSON-mode supported.
How it's built.
How much it can remember.
What it can do.
Every place this model is hosted.
Self-hosted on rented GPU cluster
self hostedMulti-GPU MoE deployment — 355B total, 32B active.
Self-hosted on rented GPU cluster
self hostedMulti-GPU MoE deployment — 355B total, 32B active.
z.ai
chat uiZhipu's consumer chat surface — free with per-message limits.
Zhipu BigModel
api directZhipu BigModel
api directTogether AI
hosted inferenceTogether AI
hosted inferenceSiliconFlow
hosted inferenceSiliconFlow
hosted inferenceOpenRouter
api aggregatorz.ai
hosted inferenceZhipu's international API surface. Same model as BigModel, English docs.