IFEval — does it actually do what you ask?
Best AI models for instruction-following.
IFEval scores whether a model obeys constraints — word counts, JSON formats, specific phrasings. The score that translates to production agent reliability.
Benchmarks used:
IFEVAL
Showing top 2 models with published data on at least one of the benchmarks above. Scores are weighted averages on a 0–100 scale.
AI model leaderboards