by OpenAI

GPT-4o.

multimodal closed 128K ctx Transformer (multimodal) Quality 77.6 Elo 1318
Cheapest input
$2.5/M
on OpenRouter
Cheapest output
$10.0/M
on OpenRouter
Fastest
57 tok/s
on OpenRouter
Hosted equiv.
~$3.6/hr
@ 100 tok/s on OpenRouter
Capability snapshot

What it's best at.

Coding 90.2
General knowledge 88.7
Math 76.6
Multimodal 69.1

Scores normalised against benchmark ceilings (100 = perfect). Coloured by tier — coral 80+ frontier, lavender 65+ strong, sage 50+ solid, slate below.

Benchmarks

Published scores.

Benchmark Score Source
GPQA 53.6 official ↗
MATH 76.6 official ↗
MMLU 88.7 official ↗
MMMU 69.1 official ↗
HumanEval 90.2 official ↗
Leaderboard standing

Independent rankings.

LMSYS Chatbot Arena
1318
Elo from blind head-to-head votes
View leaderboard ↗
Artificial Analysis Quality Index
71.0
Composite of reasoning + coding + tool-use benchmarks
View on Artificial Analysis ↗
Description

About GPT-4o.

GPT-4o (omni) is OpenAI's first model trained natively on text, vision, and audio in a single pass, released May 2024. It enables real-time voice conversations with sub-300ms latency and native image understanding without a separate vision encoder. Largely superseded by GPT-5 for new builds, but still widely deployed in production because the API surface and prompt patterns are well-understood. Cheaper than GPT-5 input-side; comparable on output. 128K context. Most often used today for voice-mode applications where its native audio support beats GPT-5's API-only voice.

Architecture

How it's built.

Architecture
Transformer (multimodal)
Knowledge cutoff
Oct 2023
225 days from cutoff to release.
Context window

How much it can remember.

128K tokens ≈ 96,000 English words
4K 32K 128K 1M
Max output per call: 16K tokens
Capabilities

What it can do.

Vision input
Audio input
· Video input
Function calling
Tool use
JSON mode
Streaming
Fine-tuning