Hyperscaler gateways
US
Hugging Face Inference Endpoints.
Hugging Face's managed inference for any model on the Hub. Auto-scales; backed by AWS/Azure/GCP.
At a glance
- Service type
- Hyperscaler gateways
- Trust tier
- Tier 1
- Headquarters
- US
- OpenAI-compat
- No
- Open weights
- Yes
- Proprietary
- No
When to pick Hugging Face Inference Endpoints
Best for
- Existing AWS / GCP / Azure customers — same IAM, same VPC, same billing.
- Regulated workloads requiring the hyperscaler's compliance frameworks.
- Multi-region production deployments tightly coupled to other cloud services.
Avoid for
- Cost-sensitive workloads — hyperscaler markup over first-party is real.
- Anyone who doesn't already need the surrounding cloud platform.
Models on Hugging Face Inference Endpoints
Pricing + measured speed + self-host alternative, one row per model. Click a column header to sort.
| Model ↕ | Maker ↕ | Access ↕ | $/M in ↕ | $/M out ↕ | Tokens/sec ↕ | TTFT ↕ | Self-host on ↕ | |
|---|---|---|---|---|---|---|---|---|
| DeepSeek R1 Distill Qwen 7B | DeepSeek | hosted inference | — | — | — | — | 1× Nvidia P102-100 · INT4 | Open → |
| DeepSeek R1 Distill Qwen 1.5B | DeepSeek | hosted inference | — | — | — | — | 1× Nvidia Titan V · FP8 | Open → |
| Whisper Medium | OpenAI | hosted inference | — | — | — | — | 1× Nvidia Titan V · FP16 | Open → |
| Whisper Small | OpenAI | hosted inference | — | — | — | — | 1× Nvidia Titan V · FP16 | Open → |
| Llama 3.2 1B | Meta AI | hosted inference | — | — | — | — | 1× Nvidia Titan V · FP8 | Open → |
| Llama 3.2 3B | Meta AI | hosted inference | — | — | — | — | 1× Nvidia Titan V · INT4 | Open → |
| DeepSeek R1 Distill Qwen 14B | DeepSeek | hosted inference | — | — | — | — | 1× Nvidia Titan V · INT4 | Open → |
| Gemma 3 27B | Google DeepMind | hosted inference | — | — | — | — | 1× Nvidia RTX 4000 Ada · INT4 | Open → |