Question 1

How is the monthly bill calculated?

Accepted Answer

Total = (input rate × input tokens / 1M) + (output rate × output tokens / 1M). We pull live per-token prices from each provider's official pricing page or /v1/models API and recompute on every page load — no caching beyond a brief edge TTL.

Question 2

Where does the pricing data come from?

Accepted Answer

Direct from each inference provider — Anthropic, OpenAI, OpenRouter, Together AI, Fireworks AI, DeepInfra, z.ai, Groq, and a dozen others. The daily refresh job (RefreshAiModelCatalogJob) re-pulls each provider's /v1/models endpoint and updates our AiModelAccess rows.

Question 3

Should I always pick the cheapest provider?

Accepted Answer

Cheapest by $/M tokens isn't always cheapest by total cost. Watch for: (1) caching discounts that aggregators like OpenRouter don't pass through fully, (2) rate-limit ceilings on the smaller hosts that force you onto a more expensive tier under load, (3) per-request latency overhead from aggregators (extra ~50ms). For low-volume or bursty workloads, the absolute cheapest is usually right. For sustained production traffic, factor in throughput + reliability.

Question 4

What's the difference between OpenRouter and the model maker's direct API?

Accepted Answer

OpenRouter is an aggregator — they route your request to one of several upstream providers and add a small markup (typically 5-20%). The model maker's direct API (e.g. api.anthropic.com for Claude) gives you the bare price + access to native features like Anthropic's prompt caching or Google's context caching. Direct is cheaper at scale; OpenRouter wins when you want one key to access dozens of models.

Question 5

Are input + output prices the same?

Accepted Answer

No — output tokens are typically 3-5× more expensive than input. The 'cheapest input' provider isn't always 'cheapest output'. The 'monthly bill' column accounts for both; sort by that column for the real total.

Question 6

How does prompt caching affect this estimate?

Accepted Answer

Cached input tokens are usually 50-90% cheaper than fresh input. We don't currently model caching because it depends on your workload pattern (long system prompts re-used across requests benefit; chat with fresh context each turn doesn't). For high-volume single-prompt workloads, halve the input cost when comparing Anthropic/Google/OpenAI direct.

Question 7

How often is this pricing refreshed?

Accepted Answer

Daily, via a scheduled background job at 4:15am UTC. Live prices show up within ~24 hours of a provider changing them on their pricing page. For breaking price drops (e.g. DeepSeek's R1 launch) we'll re-run manually.

Question 8

What about fine-tuned variants?

Accepted Answer

Fine-tuned model deployments are priced separately from the base model and often have different rate structures (hourly compute + per-token blended). This tool covers the base-model token pricing only. For fine-tuned costs, check the provider's per-deployment pricing page directly.

Google: Gemini 2.0 Flash.

Cheapest provider on the left.

Bill breakdown.

Three steps to your monthly estimate.

Pick the model.

Estimate volume.

Read the spread.

Frequently asked.

Keep going.

Self-host vs API breakeven.

VRAM fit calculator.