API aggregators OpenAI-compatible US

Fireworks AI.

Fast inference on popular open-weight models. Speculative decoding + custom kernels keep latency low. Per-token billing.

Cheapest 12 models

Where the floor is.

Sorted cheapest-first by $/M input. Useful when you're looking for the floor before picking a model.

At a glance

Service type: API aggregators
Trust tier: Tier 1
Headquarters: US
Founded: 2022
OpenAI-compat: Yes
Open weights: Yes
Proprietary: No

When to pick Fireworks AI

Best for

Building once and swapping models freely — same key, same endpoint shape.
Workloads that benefit from automatic failover across upstreams.
Anyone who wants per-token billing without managing N separate accounts.

Avoid for

Workloads needing the absolute lowest per-token price (first-party usually wins).
Anything requiring real-time price quotes from the original maker.

Models on Fireworks AI

Pricing + measured speed + self-host alternative, one row per model. Click a column header to sort.

20 models · 0 benchmarked

Model ↕	Maker ↕	Access ↕	$/M in ↕	$/M out ↕	Tokens/sec ↕	TTFT ↕	Self-host on ↕
Llama 3.1 8B	Meta AI	hosted inference	$0.2	$0.2	—	—	1× Nvidia P102-100 · INT4	Open →
Yi-34B	01.AI	hosted inference	—	—	—	—	1× Nvidia GeForce RTX 3090 · INT4	Open →
Yi-34B	01.AI	hosted inference	—	—	—	—	1× Nvidia GeForce RTX 3090 · INT4	Open →
Llama 3.3 70B	Meta AI	hosted inference	—	—	—	—	1× Nvidia L40S · INT4	Open →
DeepSeek V3	DeepSeek	hosted inference	—	—	—	—	2× AMD MI325 · INT4	Open →
Llama 3.3 70B	Meta AI	hosted inference	—	—	—	—	1× Nvidia L40S · INT4	Open →
DeepSeek V3	DeepSeek	hosted inference	—	—	—	—	2× AMD MI325 · INT4	Open →
Kimi K2.6	Moonshot AI	hosted inference	—	—	—	—	API only	Open →
MiniMax-M2.5	MiniMax	hosted inference	—	—	—	—	API only	Open →
MiniMax M2.7	MiniMax	hosted inference	—	—	—	—	1× AMD MI355X · INT4	Open →
Kimi 2.7 Code	Moonshot AI	hosted inference	—	—	—	—	API only	Open →
MiniMax: MiniMax M3	MiniMax	hosted inference	—	—	—	—	API only	Open →
NVIDIA Nemotron 3 Ultra NVFP4	Nvidia	hosted inference	—	—	—	—	API only	Open →
GLM 5.2	zai-org	hosted inference	—	—	—	—	API only	Open →
DeepSeek: DeepSeek V4 Flash	DeepSeek	hosted inference	—	—	—	—	API only	Open →
DeepSeek: DeepSeek V4 Pro	DeepSeek	hosted inference	—	—	—	—	API only	Open →
GLM 5.1	zai-org	hosted inference	—	—	—	—	API only	Open →
GPT-OSS 120B	OpenAI	hosted inference	—	—	—	—	1× Nvidia A100 · INT4	Open →
GPT-OSS 20B	OpenAI	hosted inference	—	—	—	—	1× Nvidia RTX 4060 Ti · INT4	Open →
Kimi K2.5	Moonshot AI	hosted inference	—	—	—	—	API only	Open →

Peers in the same bucket

Sanity-check before you commit.

FAQ

Frequently asked.

How does Fireworks AI bill for inference?

Most inference providers bill per million input + output tokens. Some (like Replicate, fal.ai) use per-second billing where you pay for actual compute time on the underlying GPU. Per-token rates are listed on each model's access row.

Does Fireworks AI offer an OpenAI-compatible API?

Yes. Fireworks AI exposes an OpenAI-compatible endpoint, so most OpenAI client libraries work after swapping the base URL.

Which models does Fireworks AI host?

See the model list on this page. Each row links to a per-model show page where you can compare across every provider that carries the same SKU.

What's the difference between Fireworks AI and a GPU rental host?

Fireworks AI hosts inference for you — you pay per token, the provider runs the GPU. GPU rental (Vast.ai, Lambda, RunPod) gives you the raw GPU at $/hr and you run the model yourself. Both have catalog pages on RentGPU; pick based on whether you want managed inference or full control.