Tool · API pricing

IBM Granite Code 8B.

Pick a model, estimate your monthly token volume, see the bill across every provider that hosts it. Live pricing, refreshed daily.

IBM Granite Code 8B
100,000 1,000,000,000
100,000 500,000,000

No priced API access rows on file for IBM Granite Code 8B yet.

Self-host alternative

Rent the GPU instead of paying per token.

Full breakdown →

For an open-weights model like IBM Granite Code 8B, you can rent a GPU and serve inference yourself. The math: cheapest GPU rental × 730 hours/month + your electricity rate × power draw.

GPU rental
$122.2
1× Nvidia RTX 4000 Ada SFF @ $0.17/hr
Electricity
Self-host total
$122.2
per month
Quantisation
FP16
picked for lowest VRAM

Assumes the GPU runs 24/7 at ~85% utilisation. If your traffic is bursty, you'll pay less for the API and probably more for the GPU (idle hours still cost rental). The breakeven analysis lives on the Self-host vs API breakeven tool.

How it works

Three steps to your monthly estimate.

  1. 01

    Pick the model.

    Use the search box to find an AI model — Claude, GPT, Llama, DeepSeek, Qwen, anything we track. The picker lists every model where at least one provider publishes per-token pricing.

  2. 02

    Estimate volume.

    Slide the monthly input + output token counts to match your expected workload. A typical chat app handles 1-10M input tokens per active user per month; an agent that re-reads context every turn can hit 100M+.

  3. 03

    Read the spread.

    The chart + table list every provider that hosts the model, sorted cheapest-first. Click a provider name to open its detail page — pricing history, throughput benchmarks, and the affiliate signup link.

FAQ

Frequently asked.

How is the monthly bill calculated?

Total = (input rate × input tokens / 1M) + (output rate × output tokens / 1M). We pull live per-token prices from each provider's official pricing page or /v1/models API and recompute on every page load — no caching beyond a brief edge TTL.

Where does the pricing data come from?

Direct from each inference provider — Anthropic, OpenAI, OpenRouter, Together AI, Fireworks AI, DeepInfra, z.ai, Groq, and a dozen others. The daily refresh job (RefreshAiModelCatalogJob) re-pulls each provider's /v1/models endpoint and updates our AiModelAccess rows.

Should I always pick the cheapest provider?

Cheapest by $/M tokens isn't always cheapest by total cost. Watch for: (1) caching discounts that aggregators like OpenRouter don't pass through fully, (2) rate-limit ceilings on the smaller hosts that force you onto a more expensive tier under load, (3) per-request latency overhead from aggregators (extra ~50ms). For low-volume or bursty workloads, the absolute cheapest is usually right. For sustained production traffic, factor in throughput + reliability.

What's the difference between OpenRouter and the model maker's direct API?

OpenRouter is an aggregator — they route your request to one of several upstream providers and add a small markup (typically 5-20%). The model maker's direct API (e.g. api.anthropic.com for Claude) gives you the bare price + access to native features like Anthropic's prompt caching or Google's context caching. Direct is cheaper at scale; OpenRouter wins when you want one key to access dozens of models.

Are input + output prices the same?

No — output tokens are typically 3-5× more expensive than input. The 'cheapest input' provider isn't always 'cheapest output'. The 'monthly bill' column accounts for both; sort by that column for the real total.

How does prompt caching affect this estimate?

Cached input tokens are usually 50-90% cheaper than fresh input. We don't currently model caching because it depends on your workload pattern (long system prompts re-used across requests benefit; chat with fresh context each turn doesn't). For high-volume single-prompt workloads, halve the input cost when comparing Anthropic/Google/OpenAI direct.

How often is this pricing refreshed?

Daily, via a scheduled background job at 4:15am UTC. Live prices show up within ~24 hours of a provider changing them on their pricing page. For breaking price drops (e.g. DeepSeek's R1 launch) we'll re-run manually.

What about fine-tuned variants?

Fine-tuned model deployments are priced separately from the base model and often have different rate structures (hourly compute + per-token blended). This tool covers the base-model token pricing only. For fine-tuned costs, check the provider's per-deployment pricing page directly.

Related tools

Keep going.