o3, Claude 3.7 Sonnet (extended thinking), and Gemini 2.0 Flash Thinking — how the new class of reasoning models compare on the hardest tests.
Sources: OpenAI, Anthropic, Google model cards · AIME 2024, GPQA Diamond, SWE-bench Verified, MMMU · March 2025 · Click any row to explore