Comparison · Updated May 2026

RTX 4090 vs H100: Cost per FLOP for AI training.

Live $/hr rental data + published TFLOPS specs to compute the real cost per teraflop on each card. With 8x sweep and quantization caveats.

RTX 4090 — cheapest right now

—

450 W TDP · 24 GB VRAM · 83 TFLOPS FP16

See all providers →

H100 — cheapest right now

—

700 W TDP · 80 GB VRAM · 989 TFLOPS FP16 (sparse)

See all providers →

The headline number lies

If you compare raw FP16 throughput, the H100 blows the RTX 4090 out of the water — 989 TFLOPS sparse / 495 TFLOPS dense vs the 4090's 83 TFLOPS dense. So why does anyone train on 4090s?

Per-dollar, the picture flips. The 4090 is a consumer card sold for $1,599 MSRP, available on consumer-friendly P2P marketplaces like Vast.ai and Clore.ai for $0.30–$0.50/hr. The H100 is a $30K datacenter card that rents for $2–$4/hr even at the cheapest providers.

Doing the actual math

Cost per teraflop-hour, FP16 dense:

RTX 4090: $0.0/hr ÷ 83 TFLOPS = $? per TFLOP-hour
H100 (FP16 dense, no sparsity): $0.0/hr ÷ 495 TFLOPS = $? per TFLOP-hour

So in flat per-FLOP terms, the 4090 is roughly competitive when you find a $0.30/hr listing — and worse when you don't. But raw FLOPS isn't what you're paying for. You're paying for the ability to actually finish a training run.

What the 4090 can't do

VRAM ceiling: 24 GB on the 4090 caps you at ~13B models at FP16, or ~70B at INT4 quantization. The H100's 80 GB fits a full 70B model at FP16 with room for activations and KV cache.
No NVLink: the 4090 lacks the NVLink connector that lets 8× datacenter cards behave like one big GPU. Multi-4090 setups talk over PCIe, which throttles all-reduce in distributed training to fractions of NVLink bandwidth.
Sparsity: the H100's "989 TFLOPS" number assumes 2:4 structured sparsity. The 4090 has no equivalent feature.
FP8: the H100 natively supports FP8 (1979 TFLOPS sparse). The 4090 has to emulate it — slower, less accurate.

Where the 4090 still wins

Image generation (Stable Diffusion XL, Flux): models fit, no NVLink needed, 4090 finishes batches almost as fast as an L40S at a quarter the price.
Hobbyist fine-tuning: LoRA / QLoRA on 7B–13B models is a 4090's natural job.
Batch inference of smaller models: serving Qwen 7B or Llama 8B on a 4090 at $0.40/hr is cheaper than the same model on a 1× T4 ($0.50/hr) and runs ~3× faster.
Single-host fine-tuning: when you don't need to scale past one card, the 4090's per-FLOP advantage is real.

When to pay for H100s

LLM training above 13B parameters — anything bigger needs the VRAM or the multi-GPU bandwidth.
Production inference where you need low p99 latency and a real SLA.
Fast multi-GPU scaling — distributed data-parallel + tensor-parallel runs are 3–10× faster on NVLink than on PCIe.
FP8 training — the H100's native FP8 path roughly doubles throughput on supported frameworks.

Live provider comparison

FAQ

Is the RTX 4090 really cheaper per FLOP than the H100?

It depends on precision. At FP16 dense, the RTX 4090 delivers ~83 TFLOPS vs the H100's ~989 TFLOPS sparse / 495 TFLOPS dense — so the H100 wins on absolute throughput. Per-dollar, the picture flips on consumer marketplaces (Vast.ai, Clore) where 4090s rent for $0.30-0.50/hr vs H100s at $2-4/hr. The article runs the math for both.

What's the catch with renting an RTX 4090 instead of an H100?

Two main ones: (1) VRAM — 24 GB on the 4090 vs 80 GB on the H100, so anything larger than a 13B model at FP16 either needs quantization or multiple 4090s; (2) no NVLink — multi-4090 setups bottleneck on PCIe bandwidth, so 8× 4090s rarely scale as well as 8× H100 for distributed training.

Which providers carry both cards?

Vast.ai and Clore have both RTX 4090s and H100s. Lambda Labs, RunPod, and the hyperscalers carry H100s but typically not consumer 4090s — they sell the workstation-class L40S or A6000 instead.

When should I just pay for H100s?

For LLM training >13B parameters, for production inference where consistent low latency matters, or anywhere you need fast multi-GPU scaling. The 4090 wins on cost-per-FLOP for hobbyist fine-tuning, batch inference of smaller models, and image generation (SDXL, Flux).