API aggregators US

Replicate.

Per-second billing serverless model API. Strong for image / video / audio models alongside LLMs.

At a glance

Service type: API aggregators
Trust tier: Tier 1
Headquarters: US
OpenAI-compat: No
Open weights: Yes
Proprietary: No

When to pick Replicate

Best for

Building once and swapping models freely — same key, same endpoint shape.
Workloads that benefit from automatic failover across upstreams.
Anyone who wants per-token billing without managing N separate accounts.

Avoid for

Workloads needing the absolute lowest per-token price (first-party usually wins).
Anything requiring real-time price quotes from the original maker.

Models on Replicate

Pricing + measured speed + self-host alternative, one row per model. Click a column header to sort.

8 models · 0 benchmarked

Model ↕	Maker ↕	Access ↕	$/M in ↕	$/M out ↕	Tokens/sec ↕	TTFT ↕	Self-host on ↕
Stable Diffusion XL	Stability AI	hosted inference	—	—	—	—	1× Nvidia Titan V · INT4	Open →
Stable Diffusion 3.5 Medium	Stability AI	hosted inference	—	—	—	—	1× Nvidia GeForce GTX 1050 · INT4	Open →
Stable Diffusion 1.5	Stability AI	hosted inference	—	—	—	—	1× Nvidia Titan V · FP16	Open →
FLUX.1 Dev	Black Forest Labs	hosted inference	—	—	—	—	1× AMD Radeon RX 5700 XT · INT4	Open →
Stable Diffusion 3.5 Large	Stability AI	hosted inference	—	—	—	—	1× Nvidia GeForce RTX 2060 · INT4	Open →
FLUX.1 Pro	Black Forest Labs	hosted inference	—	—	—	—	1× AMD Radeon RX 5700 XT · INT4	Open →
Whisper Large v3	OpenAI	hosted inference	—	—	—	—	1× Nvidia Titan V · FP8	Open →
Llama 3.3 70B	Meta AI	hosted inference	—	—	—	—	1× Nvidia L40S · INT4	Open →

Peers in the same bucket

Sanity-check before you commit.

FAQ

Frequently asked.

How does Replicate bill for inference?

Most inference providers bill per million input + output tokens. Some (like Replicate, fal.ai) use per-second billing where you pay for actual compute time on the underlying GPU. Per-token rates are listed on each model's access row.

Does Replicate offer an OpenAI-compatible API?

Not by default. Use Replicate's native SDK — most clients support a thin OpenAI compatibility wrapper if you need it.

Which models does Replicate host?

See the model list on this page. Each row links to a per-model show page where you can compare across every provider that carries the same SKU.

What's the difference between Replicate and a GPU rental host?

Replicate hosts inference for you — you pay per token, the provider runs the GPU. GPU rental (Vast.ai, Lambda, RunPod) gives you the raw GPU at $/hr and you run the model yourself. Both have catalog pages on RentGPU; pick based on whether you want managed inference or full control.