Hyperscaler gateways US

Hugging Face Inference Endpoints.

Hugging Face's managed inference for any model on the Hub. Auto-scales; backed by AWS/Azure/GCP.

At a glance

Service type
Hyperscaler gateways
Trust tier
Tier 1
Headquarters
US
OpenAI-compat
No
Open weights
Yes
Proprietary
No

When to pick Hugging Face Inference Endpoints

Best for

  • Existing AWS / GCP / Azure customers — same IAM, same VPC, same billing.
  • Regulated workloads requiring the hyperscaler's compliance frameworks.
  • Multi-region production deployments tightly coupled to other cloud services.

Avoid for

  • Cost-sensitive workloads — hyperscaler markup over first-party is real.
  • Anyone who doesn't already need the surrounding cloud platform.

Models on Hugging Face Inference Endpoints

Pricing + measured speed + self-host alternative, one row per model. Click a column header to sort.

8 models · 0 benchmarked
Model ↕ Maker ↕ Access ↕ $/M in ↕ $/M out ↕ Tokens/sec ↕ TTFT ↕ Self-host on ↕
DeepSeek R1 Distill Qwen 7B DeepSeek hosted inference 1× Nvidia P102-100 · INT4 Open →
DeepSeek R1 Distill Qwen 1.5B DeepSeek hosted inference 1× Nvidia Titan V · FP8 Open →
Whisper Medium OpenAI hosted inference 1× Nvidia Titan V · FP16 Open →
Whisper Small OpenAI hosted inference 1× Nvidia Titan V · FP16 Open →
Llama 3.2 1B Meta AI hosted inference 1× Nvidia Titan V · FP8 Open →
Llama 3.2 3B Meta AI hosted inference 1× Nvidia Titan V · INT4 Open →
DeepSeek R1 Distill Qwen 14B DeepSeek hosted inference 1× Nvidia Titan V · INT4 Open →
Gemma 3 27B Google DeepMind hosted inference 1× Nvidia RTX 4000 Ada · INT4 Open →