AI & LLM

Reasoning Model Benchmarks 2025

o3, Claude 3.7 Sonnet (extended thinking), and Gemini 2.0 Flash Thinking — how the new class of reasoning models compare on the hardest tests.

By Benchmark

Score Matrix

Sources: OpenAI, Anthropic, Google model cards · AIME 2024, GPQA Diamond, SWE-bench Verified, MMMU · March 2025 · Click any row to explore

Made with sHTMLs