IFEval — does it actually do what you ask?

Best AI models for instruction-following.

IFEval scores whether a model obeys constraints — word counts, JSON formats, specific phrasings. The score that translates to production agent reliability.

Benchmarks used: IFEVAL

#	Model	Score	From
1	Llama 3.3 70B open	92.1	Meta AI
2	Llama 3.1 405B open	88.6	Meta AI

Showing top 2 models with published data on at least one of the benchmarks above. Scores are weighted averages on a 0–100 scale.

AI model leaderboards

More leaderboards.

Best AI models for coding → Best AI models for reasoning → Best AI models for math → Best AI models for general knowledge → Best AI models for vision → Cheapest capable AI models →