Hyperscaler gateways US

Hugging Face Inference Endpoints.

Hugging Face's managed inference for any model on the Hub. Auto-scales; backed by AWS/Azure/GCP.

At a glance

Service type: Hyperscaler gateways
Trust tier: Tier 1
Headquarters: US
OpenAI-compat: No
Open weights: Yes
Proprietary: No

When to pick Hugging Face Inference Endpoints

Best for

Existing AWS / GCP / Azure customers — same IAM, same VPC, same billing.
Regulated workloads requiring the hyperscaler's compliance frameworks.
Multi-region production deployments tightly coupled to other cloud services.

Avoid for

Cost-sensitive workloads — hyperscaler markup over first-party is real.
Anyone who doesn't already need the surrounding cloud platform.

Models on Hugging Face Inference Endpoints

Pricing + measured speed + self-host alternative, one row per model. Click a column header to sort.

8 models · 0 benchmarked

Model ↕	Maker ↕	Access ↕	$/M in ↕	$/M out ↕	Tokens/sec ↕	TTFT ↕	Self-host on ↕
DeepSeek R1 Distill Qwen 7B	DeepSeek	hosted inference	—	—	—	—	1× Nvidia P102-100 · INT4	Open →
DeepSeek R1 Distill Qwen 1.5B	DeepSeek	hosted inference	—	—	—	—	1× Nvidia Titan V · FP8	Open →
Whisper Medium	OpenAI	hosted inference	—	—	—	—	1× Nvidia Titan V · FP16	Open →
Whisper Small	OpenAI	hosted inference	—	—	—	—	1× Nvidia Titan V · FP16	Open →
Llama 3.2 1B	Meta AI	hosted inference	—	—	—	—	1× Nvidia Titan V · FP8	Open →
Llama 3.2 3B	Meta AI	hosted inference	—	—	—	—	1× Nvidia Titan V · INT4	Open →
DeepSeek R1 Distill Qwen 14B	DeepSeek	hosted inference	—	—	—	—	1× Nvidia Titan V · INT4	Open →
Gemma 3 27B	Google DeepMind	hosted inference	—	—	—	—	1× Nvidia RTX 4000 Ada · INT4	Open →

Peers in the same bucket

Sanity-check before you commit.

AWS Bedrock 4

Google Vertex AI 5

Azure OpenAI Service 4

FAQ

Frequently asked.

How does Hugging Face Inference Endpoints bill for inference?

Most inference providers bill per million input + output tokens. Some (like Replicate, fal.ai) use per-second billing where you pay for actual compute time on the underlying GPU. Per-token rates are listed on each model's access row.

Does Hugging Face Inference Endpoints offer an OpenAI-compatible API?

Not by default. Use Hugging Face Inference Endpoints's native SDK — most clients support a thin OpenAI compatibility wrapper if you need it.

Which models does Hugging Face Inference Endpoints host?

See the model list on this page. Each row links to a per-model show page where you can compare across every provider that carries the same SKU.

What's the difference between Hugging Face Inference Endpoints and a GPU rental host?

Hugging Face Inference Endpoints hosts inference for you — you pay per token, the provider runs the GPU. GPU rental (Vast.ai, Lambda, RunPod) gives you the raw GPU at $/hr and you run the model yourself. Both have catalog pages on RentGPU; pick based on whether you want managed inference or full control.