Use case

Run LLMs (inference / serving).

Inference is bandwidth-bound: VRAM size sets the maximum model + context window, but you can mix quantization (int4/int8/fp8) to fit larger models on cheaper hardware. RTX 4090 runs 70B models at int4 quality.

≥ 12GB VRAM
Best GPUs

Top GPUs for this workload.

Ranked by suitability — higher fitness scores mean the card handles this workload more comfortably.

GPU Tier VRAM Fit
datacenter 192GB 100 Compare →
datacenter 141GB 95 Compare →
datacenter 80GB 90 Compare →
consumer 32GB 80 Compare →
workstation 48GB 80 Compare →
datacenter 48GB 80 Compare →
datacenter 80GB 80 Compare →
consumer 24GB 75 Compare →
workstation 48GB 75 Compare →
workstation 32GB 65 Compare →
datacenter 24GB 65 Compare →
consumer 16GB 60 Compare →
Best AI models

Top models for this workload.