Serverless specialty US

Ollama.

Local-first runtime for open-weight LLMs — run them on your own machine or rented GPU. Library indexes Llama, Kimi, GLM, Qwen, …

At a glance

Service type: Serverless specialty
Trust tier: Tier 2
Headquarters: US
OpenAI-compat: No
Open weights: Yes
Proprietary: No

When to pick Ollama

Best for

Ultra-low-latency inference (Groq's LPU silicon, Cerebras).
Image / video / audio generation via per-second billing.
Workloads where the specialty's hardware advantage outweighs cost.

Avoid for

General LLM workloads where a generalist aggregator is cheaper.
Workloads needing feature parity across many models.

Models on Ollama

Pricing + measured speed + self-host alternative, one row per model. Click a column header to sort.

44 models · 0 benchmarked

Model ↕	Maker ↕	Access ↕	$/M in ↕	$/M out ↕	Tokens/sec ↕	TTFT ↕	Self-host on ↕
Kimi K2 Thinking	Moonshot AI	self hosted	—	—	—	—	4× AMD MI300 · INT4	Open →
Kimi K2.5	Moonshot AI	self hosted	—	—	—	—	4× AMD MI300 · INT4	Open →
Kimi K2.6	Moonshot AI	self hosted	—	—	—	—	4× AMD MI300 · INT4	Open →
Llama 3.1 405B	Meta AI	self hosted	—	—	—	—	1× AMD MI325 · INT4	Open →
Qwen 2.5 72B	Alibaba (Qwen Team)	self hosted	—	—	—	—	1× NVIDIA A40 · INT4	Open →
Qwen 2.5 Coder 32B	Alibaba (Qwen Team)	self hosted	—	—	—	—	1× Nvidia RTX A5000 · INT4	Open →
Llama 3.1 8B	Meta AI	self hosted	—	—	—	—	1× Nvidia P102-100 · INT4	Open →
Llama 3.1 70B	Meta AI	self hosted	—	—	—	—	1× Nvidia L40S · INT4	Open →
Llama 3.2 1B	Meta AI	self hosted	—	—	—	—	1× Nvidia Titan V · FP8	Open →
Llama 3.2 3B	Meta AI	self hosted	—	—	—	—	1× Nvidia Titan V · INT4	Open →
Qwen 2.5 7B	Alibaba (Qwen Team)	self hosted	—	—	—	—	1× Nvidia P102-100 · INT4	Open →
Qwen 2.5 14B	Alibaba (Qwen Team)	self hosted	—	—	—	—	1× Nvidia CMP 50HX · INT4	Open →
Qwen 2.5 32B	Alibaba (Qwen Team)	self hosted	—	—	—	—	1× Nvidia RTX A5000 · INT4	Open →
Qwen 2.5 3B	Alibaba (Qwen Team)	self hosted	—	—	—	—	1× Nvidia GeForce GTX 1050 · INT4	Open →
Qwen 3 235B	Alibaba (Qwen Team)	self hosted	—	—	—	—	1× AMD MI300 · INT4	Open →
Qwen 3 32B	Alibaba (Qwen Team)	self hosted	—	—	—	—	1× Nvidia RTX A5000 · INT4	Open →
Qwen 3 14B	Alibaba (Qwen Team)	self hosted	—	—	—	—	1× Nvidia CMP 50HX · INT4	Open →
Qwen 3 8B	Alibaba (Qwen Team)	self hosted	—	—	—	—	1× Nvidia GeForce RTX 2060 · INT4	Open →
Qwen 3 4B	Alibaba (Qwen Team)	self hosted	—	—	—	—	1× Nvidia Titan V · INT4	Open →
Gemma 2 27B	Google DeepMind	self hosted	—	—	—	—	1× Nvidia RTX 4000 Ada · INT4	Open →
Gemma 2 9B	Google DeepMind	self hosted	—	—	—	—	1× Nvidia GeForce RTX 2060 · INT4	Open →
Gemma 2 2B	Google DeepMind	self hosted	—	—	—	—	1× Nvidia GeForce GTX 1050 · INT4	Open →
Gemma 3 12B	Google DeepMind	self hosted	—	—	—	—	1× AMD Radeon RX 5700 XT · INT4	Open →
Gemma 3 4B	Google DeepMind	self hosted	—	—	—	—	1× Nvidia Titan V · INT4	Open →
Gemma 3 1B	Google DeepMind	self hosted	—	—	—	—	1× Nvidia Titan V · FP16	Open →
LLaVA 34B	LLaVA Project	self hosted	—	—	—	—	1× Nvidia RTX A5000 · INT4	Open →
LLaVA 13B	LLaVA Project	self hosted	—	—	—	—	1× Nvidia GeForce RTX 3080 · INT4	Open →
LLaVA 7B	LLaVA Project	self hosted	—	—	—	—	1× Nvidia P102-100 · INT4	Open →
Code Llama 70B	Meta AI	self hosted	—	—	—	—	1× Nvidia L40S · INT4	Open →
Code Llama 34B	Meta AI	self hosted	—	—	—	—	1× Nvidia GeForce RTX 3090 · INT4	Open →
Code Llama 13B	Meta AI	self hosted	—	—	—	—	1× Nvidia CMP 50HX · INT4	Open →
Code Llama 7B	Meta AI	self hosted	—	—	—	—	1× Nvidia P102-100 · INT4	Open →
DeepSeek Coder V2 236B	DeepSeek	self hosted	—	—	—	—	1× AMD MI300 · INT4	Open →
DeepSeek Coder V2 Lite	DeepSeek	self hosted	—	—	—	—	1× Nvidia Titan V · INT4	Open →
DeepSeek Coder 33B	DeepSeek	self hosted	—	—	—	—	1× Nvidia GeForce RTX 3090 · INT4	Open →
Mistral Nemo 12B	Mistral AI	self hosted	—	—	—	—	1× AMD Radeon RX 5700 XT · INT4	Open →
GPT-OSS 120B	OpenAI	self hosted	—	—	—	—	1× Nvidia A100 · INT4	Open →
GPT-OSS 20B	OpenAI	self hosted	—	—	—	—	1× Nvidia RTX 4060 Ti · INT4	Open →
IBM Granite Code 8B	IBM Research	self hosted	—	—	—	—	1× Nvidia P102-100 · INT4	Open →
Hermes 3 70B	Nous Research	self hosted	—	—	—	—	1× Nvidia L40S · INT4	Open →
Hermes 3 8B	Nous Research	self hosted	—	—	—	—	1× Nvidia P102-100 · INT4	Open →
OLMo 3 7B	Allen Institute for AI (AI2)	self hosted	—	—	—	—	1× Nvidia P102-100 · INT4	Open →
Gemma 3 27B	Google DeepMind	self hosted	—	—	—	—	1× Nvidia RTX 4000 Ada · INT4	Open →
Llama 3.3 70B	Meta AI	self hosted	—	—	—	—	1× Nvidia L40S · INT4	Open →

Peers in the same bucket

Sanity-check before you commit.

Groq 5

fal.ai 4

Black Forest Labs API 1

Stability AI Platform 1

FAQ

Frequently asked.

How does Ollama bill for inference?

Most inference providers bill per million input + output tokens. Some (like Replicate, fal.ai) use per-second billing where you pay for actual compute time on the underlying GPU. Per-token rates are listed on each model's access row.

Does Ollama offer an OpenAI-compatible API?

Not by default. Use Ollama's native SDK — most clients support a thin OpenAI compatibility wrapper if you need it.

Which models does Ollama host?

See the model list on this page. Each row links to a per-model show page where you can compare across every provider that carries the same SKU.

What's the difference between Ollama and a GPU rental host?

Ollama hosts inference for you — you pay per token, the provider runs the GPU. GPU rental (Vast.ai, Lambda, RunPod) gives you the raw GPU at $/hr and you run the model yourself. Both have catalog pages on RentGPU; pick based on whether you want managed inference or full control.