by DeepSeek

DeepSeek R1 Distill Llama 70B.

text open weights workstation 70B params 128K ctx Transformer (distilled) Quality 75.9
🧬 Distilled from DeepSeek R1 — smaller, cheaper to run, similar reasoning style.
Cheapest input
$0.7/M
on DeepSeek Platform
Cheapest output
$0.8/M
on DeepSeek Platform
Fastest
47 tok/s
on OpenRouter
Smallest GPU
1× Nvidia L40S
$0.28/hr
Capability snapshot

What it's best at.

Math 94.5
Coding 83.0
Reasoning 70.0

Scores normalised against benchmark ceilings (100 = perfect). Coloured by tier — coral 80+ frontier, lavender 65+ strong, sage 50+ solid, slate below.

Benchmarks

Published scores.

Benchmark Score Source
MATH 94.5 official ↗
MMLU-Pro 70.0 official ↗
HumanEval 83.0 official ↗
Description

About DeepSeek R1 Distill Llama 70B.

DeepSeek R1 Distill Llama 70B is a distilled student model — base architecture is Llama 3.3 70B, but post-trained on R1's reasoning chain-of-thought traces. Inherits most of R1's math and coding capability at 5% of the inference cost. Released MIT-licensed alongside the R1 paper. Fits comfortably on 2× H100 at FP16, or 1× H100 at INT4. Widely deployed as a cost-sensitive reasoning workhorse — much cheaper than full R1, much smarter than vanilla Llama 70B.

Architecture

How it's built.

Architecture
Transformer (distilled)
Knowledge cutoff
Jul 2024
203 days from cutoff to release.
Context window

How much it can remember.

128K tokens ≈ 96,000 English words
4K 32K 128K 1M
Max output per call: 33K tokens
Capabilities

What it can do.

· Vision input
· Audio input
· Video input
· Function calling
· Tool use
· JSON mode
Streaming
Fine-tuning