Question 1

How is the monthly bill calculated?

Accepted Answer

Total = (input rate × input tokens / 1M) + (output rate × output tokens / 1M). We pull live per-token prices from each provider's official pricing page or /v1/models API and recompute on every page load — no caching beyond a brief edge TTL.

Question 2

Where does the pricing data come from?

Accepted Answer

Direct from each inference provider — Anthropic, OpenAI, OpenRouter, Together AI, Fireworks AI, DeepInfra, z.ai, Groq, and a dozen others. The daily refresh job (RefreshAiModelCatalogJob) re-pulls each provider's /v1/models endpoint and updates our AiModelAccess rows.

Question 3

Should I always pick the cheapest provider?

Accepted Answer

Cheapest by $/M tokens isn't always cheapest by total cost. Watch for: (1) caching discounts that aggregators like OpenRouter don't pass through fully, (2) rate-limit ceilings on the smaller hosts that force you onto a more expensive tier under load, (3) per-request latency overhead from aggregators (extra ~50ms). For low-volume or bursty workloads, the absolute cheapest is usually right. For sustained production traffic, factor in throughput + reliability.

Question 4

What's the difference between OpenRouter and the model maker's direct API?

Accepted Answer

OpenRouter is an aggregator — they route your request to one of several upstream providers and add a small markup (typically 5-20%). The model maker's direct API (e.g. api.anthropic.com for Claude) gives you the bare price + access to native features like Anthropic's prompt caching or Google's context caching. Direct is cheaper at scale; OpenRouter wins when you want one key to access dozens of models.

Question 5

Are input + output prices the same?

Accepted Answer

No — output tokens are typically 3-5× more expensive than input. The 'cheapest input' provider isn't always 'cheapest output'. The 'monthly bill' column accounts for both; sort by that column for the real total.

Question 6

How does prompt caching affect this estimate?

Accepted Answer

Cached input tokens are usually 50-90% cheaper than fresh input. We don't currently model caching because it depends on your workload pattern (long system prompts re-used across requests benefit; chat with fresh context each turn doesn't). For high-volume single-prompt workloads, halve the input cost when comparing Anthropic/Google/OpenAI direct.

Question 7

How often is this pricing refreshed?

Accepted Answer

Daily, via a scheduled background job at 4:15am UTC. Live prices show up within ~24 hours of a provider changing them on their pricing page. For breaking price drops (e.g. DeepSeek's R1 launch) we'll re-run manually.

Question 8

What about fine-tuned variants?

Accepted Answer

Fine-tuned model deployments are priced separately from the base model and often have different rate structures (hourly compute + per-token blended). This tool covers the base-model token pricing only. For fine-tuned costs, check the provider's per-deployment pricing page directly.

IBM Granite Code 8B.

Rent the GPU instead of paying per token.

Three steps to your monthly estimate.

Pick the model.

Estimate volume.

Read the spread.

Frequently asked.

Keep going.

Self-host vs API breakeven.

VRAM fit calculator.