RTX 4090 vs H100: Cost per FLOP for AI training.
Live $/hr rental data + published TFLOPS specs to compute the real cost per teraflop on each card. With 8x sweep and quantization caveats.
The headline number lies
If you compare raw FP16 throughput, the H100 blows the RTX 4090 out of the water — 989 TFLOPS sparse / 495 TFLOPS dense vs the 4090's 83 TFLOPS dense. So why does anyone train on 4090s?
Per-dollar, the picture flips. The 4090 is a consumer card sold for $1,599 MSRP, available on consumer-friendly P2P marketplaces like Vast.ai and Clore.ai for $0.30–$0.50/hr. The H100 is a $30K datacenter card that rents for $2–$4/hr even at the cheapest providers.
Doing the actual math
Cost per teraflop-hour, FP16 dense:
- RTX 4090: $0.115/hr ÷ 83 TFLOPS = $0.00139 per TFLOP-hour
- H100 (FP16 dense, no sparsity): $0.8/hr ÷ 495 TFLOPS = $0.00162 per TFLOP-hour
So in flat per-FLOP terms, the 4090 is roughly competitive when you find a $0.30/hr listing — and worse when you don't. But raw FLOPS isn't what you're paying for. You're paying for the ability to actually finish a training run.
What the 4090 can't do
- VRAM ceiling: 24 GB on the 4090 caps you at ~13B models at FP16, or ~70B at INT4 quantization. The H100's 80 GB fits a full 70B model at FP16 with room for activations and KV cache.
- No NVLink: the 4090 lacks the NVLink connector that lets 8× datacenter cards behave like one big GPU. Multi-4090 setups talk over PCIe, which throttles all-reduce in distributed training to fractions of NVLink bandwidth.
- Sparsity: the H100's "989 TFLOPS" number assumes 2:4 structured sparsity. The 4090 has no equivalent feature.
- FP8: the H100 natively supports FP8 (1979 TFLOPS sparse). The 4090 has to emulate it — slower, less accurate.
Where the 4090 still wins
- Image generation (Stable Diffusion XL, Flux): models fit, no NVLink needed, 4090 finishes batches almost as fast as an L40S at a quarter the price.
- Hobbyist fine-tuning: LoRA / QLoRA on 7B–13B models is a 4090's natural job.
- Batch inference of smaller models: serving Qwen 7B or Llama 8B on a 4090 at $0.40/hr is cheaper than the same model on a 1× T4 ($0.50/hr) and runs ~3× faster.
- Single-host fine-tuning: when you don't need to scale past one card, the 4090's per-FLOP advantage is real.
When to pay for H100s
- LLM training above 13B parameters — anything bigger needs the VRAM or the multi-GPU bandwidth.
- Production inference where you need low p99 latency and a real SLA.
- Fast multi-GPU scaling — distributed data-parallel + tensor-parallel runs are 3–10× faster on NVLink than on PCIe.
- FP8 training — the H100's native FP8 path roughly doubles throughput on supported frameworks.
Live provider comparison
RTX 4090 providers (top 4 by price, refreshed hourly):
| Provider | $/hr | Offers |
|---|---|---|
| Vast.ai | $0.12/hr | 1 |
| Novita AI | $0.18/hr | 1 |
| io.net | $0.24/hr | 629 |
| TensorDock | $0.27/hr | 16 |
H100 providers:
| Provider | $/hr | Offers |
|---|---|---|
| io.net | $0.80/hr | 0 |
| Fluence | $1.05/hr | 3 |
| Vast.ai | $1.25/hr | 1 |
| Thunder Compute | $1.47/hr | 1 |