GPT-4o, Claude 3.7 Sonnet, Gemini 2.0 Flash, and Llama 3.3 — scored across reasoning, coding, math, and knowledge benchmarks.
Sources: official model cards, HELM, Papers With Code · Scores as of March 2025 · Click any row for full benchmark breakdown