AI & LLM

LLM Benchmark Comparison 2025

GPT-4o, Claude 3.7 Sonnet, Gemini 2.0 Flash, and Llama 3.3 — scored across reasoning, coding, math, and knowledge benchmarks.

Sort by:

Rank	Model	Overall ↓	MMLU	HumanEval	MATH	GPQA	Context

Sources: official model cards, HELM, Papers With Code · Scores as of March 2025 · Click any row for full benchmark breakdown

Made with sHTMLs