by Meta AI

Llama 3.2 11B Vision.

vision open weights laptop+ 11B params 128K ctx Transformer (vision-adapted) Quality 73.0
Cheapest input
$0.245/M
on OpenRouter
Cheapest output
$0.245/M
on OpenRouter
Fastest
41 tok/s
on OpenRouter
Smallest GPU
1× AMD Radeon RX 5700 XT
Capability snapshot

What it's best at.

General knowledge 73.0
Multimodal 50.7

Scores normalised against benchmark ceilings (100 = perfect). Coloured by tier — coral 80+ frontier, lavender 65+ strong, sage 50+ solid, slate below.

Benchmarks

Published scores.

Benchmark Score Source
MMLU 73.0 official ↗
MMMU 50.7 official ↗
Description

About Llama 3.2 11B Vision.

Llama 3.2 11B Vision is Meta's first open-weight multimodal LLM in the 11B range. Vision encoder bolted onto the Llama text backbone via cross-attention adapters (not native multimodal training). Strong on chart understanding and document OCR; weaker than GPT-4o or Gemini on photorealistic image reasoning. Runs on a single 24GB GPU (RTX 4090 / RTX 5090) at INT4 — much more accessible than the 90B sibling for hobbyist multimodal work.

Architecture

How it's built.

Architecture
Transformer (vision-adapted)
Trained on
6.0T tokens
561 tokens per parameter — well above the Chinchilla optimum.
Knowledge cutoff
Dec 2023
299 days from cutoff to release.
Context window

How much it can remember.

128K tokens ≈ 96,000 English words
4K 32K 128K 1M
Max output per call: 4K tokens
Capabilities

What it can do.

Vision input
· Audio input
· Video input
· Function calling
· Tool use
· JSON mode
Streaming
Fine-tuning