Run LLMs (inference / serving).
Inference is bandwidth-bound: VRAM size sets the maximum model + context window, but you can mix quantization (int4/int8/fp8) to fit larger models on cheaper hardware. RTX 4090 runs 70B models at int4 quality.
Top GPUs for this workload.
Ranked by suitability — higher fitness scores mean the card handles this workload more comfortably.
| GPU | Tier | VRAM | Fit | |
|---|---|---|---|---|
| datacenter | 192GB | 100 | Compare → | |
| datacenter | 141GB | 95 | Compare → | |
| datacenter | 80GB | 90 | Compare → | |
| consumer | 32GB | 80 | Compare → | |
| workstation | 48GB | 80 | Compare → | |
| datacenter | 48GB | 80 | Compare → | |
| datacenter | 80GB | 80 | Compare → | |
| consumer | 24GB | 75 | Compare → | |
| workstation | 48GB | 75 | Compare → | |
| workstation | 32GB | 65 | Compare → | |
| datacenter | 24GB | 65 | Compare → | |
| consumer | 16GB | 60 | Compare → |
Top models for this workload.
GPT-5
OpenAI's frontier multimodal reasoning model.
Claude Opus 4.7
Frontier reasoning and long-form coding from Anthropic.
Qwen 3 235B
Alibaba's frontier MoE — 235B total / 22B active.
Claude 3.5 Sonnet
Anthropic's 3.5 generation — still in active production.
DeepSeek R1 Distill Llama 70B
70B Llama distilled from DeepSeek R1's reasoning traces.
DeepSeek R1
DeepSeek's reasoning model — RL-trained, frontier-class, MIT-licensed.
DeepSeek V3
DeepSeek's flagship MoE — 671B total, 37B active, frontier-class.
Gemini 2.5 Pro
Google's frontier reasoning model with native 1M-token context.
Claude Sonnet 4.6
Best price-performance from Anthropic. Default for production agents.
Hermes 3 70B
Nous Research's flagship Llama fine-tune — agent-friendly.
GPT-OSS 120B
OpenAI's first open-weight LLM in years — Apache-licensed MoE.
Qwen 3 32B
Dense 32B Qwen 3.