Serverless specialty OpenAI-compatible US

Groq.

Ultra-low-latency inference on custom LPU silicon. Open-weight LLMs at >500 tokens/sec; OpenAI-compatible API.

Cheapest 12 models

Where the floor is.

Sorted cheapest-first by $/M input. Useful when you're looking for the floor before picking a model.

At a glance

Service type: Serverless specialty
Trust tier: Tier 1
Headquarters: US
OpenAI-compat: Yes
Open weights: Yes
Proprietary: No

When to pick Groq

Best for

Ultra-low-latency inference (Groq's LPU silicon, Cerebras).
Image / video / audio generation via per-second billing.
Workloads where the specialty's hardware advantage outweighs cost.

Avoid for

General LLM workloads where a generalist aggregator is cheaper.
Workloads needing feature parity across many models.

Models on Groq

Pricing + measured speed + self-host alternative, one row per model. Click a column header to sort.

5 models · 0 benchmarked

Model ↕	Maker ↕	Access ↕	$/M in ↕	$/M out ↕	Tokens/sec ↕	TTFT ↕	Self-host on ↕
Whisper Large v3	OpenAI	hosted inference	—	—	—	—	1× Nvidia Titan V · FP8	Open →
Llama 3.1 70B	Meta AI	hosted inference	$0.59	$0.79	—	—	1× Nvidia L40S · INT4	Open →
DeepSeek R1 Distill Llama 70B	DeepSeek	hosted inference	$0.75	$0.99	—	—	1× Nvidia L40S · INT4	Open →
Llama 3.3 70B	Meta AI	hosted inference	—	—	—	—	1× Nvidia L40S · INT4	Open →
Llama 3.1 8B	Meta AI	hosted inference	$0.05	$0.08	—	—	1× Nvidia P102-100 · INT4	Open →

Peers in the same bucket

Sanity-check before you commit.

fal.ai 4

Black Forest Labs API 1

Stability AI Platform 1

Ollama 44

FAQ

Frequently asked.

How does Groq bill for inference?

Most inference providers bill per million input + output tokens. Some (like Replicate, fal.ai) use per-second billing where you pay for actual compute time on the underlying GPU. Per-token rates are listed on each model's access row.

Does Groq offer an OpenAI-compatible API?

Yes. Groq exposes an OpenAI-compatible endpoint, so most OpenAI client libraries work after swapping the base URL.

Which models does Groq host?

See the model list on this page. Each row links to a per-model show page where you can compare across every provider that carries the same SKU.

What's the difference between Groq and a GPU rental host?

Groq hosts inference for you — you pay per token, the provider runs the GPU. GPU rental (Vast.ai, Lambda, RunPod) gives you the raw GPU at $/hr and you run the model yourself. Both have catalog pages on RentGPU; pick based on whether you want managed inference or full control.