Home > Leaderboard

GDPval Leaderboard - Model Performance

This board builds on the official OpenAI GDPval framework and consolidates multi-dimensional third-party data to deliver a holistic performance ranking, helping you identify AI models capable of expert-grade results on professional tasks.

20+

Models tracked

Integrated evaluation systems

Expert pick
GPT-5.2
Claude Opus
Top overall performers

December 15, 2025

Last updated

Overall Performance Leaderboard

Rank	Company	Model	Score ⓘ	ELO ⓘ	Release Date	Key Tags
1	OpenAI	GPT-5.2 (xhigh)	-	1474 (-46 / +58)	Dec. 2025	Flagship Accuracy Low error Domain-specific
2	Anthropic	Claude Opus 4.5 (Reasoning)	-	1410 (-45 / +45)	Nov. 2025	Reasoning Aesthetics
3	OpenAI	GPT-5 (high)	-	1303 (-44 / +46)	Aug. 2025	Accuracy Low error Text-only Domain-specific
4	Anthropic	Claude 4.5 Sonnet (Reasoning)	-	1290 (-44 / +43)	Sep. 2025	Reasoning
5	OpenAI	GPT-5.1 (high)	-	1241 (-43 / +45)	Nov. 2025	High tier
6	DeepSeek	DeepSeek V3.2 (Reasoning)	-	1208 (-43 / +47)	Dec. 2025	Reasoning
7	Google	Gemini 3 Pro Preview (high)	-	1206 (-43 / +43)	Nov. 2025	Multimodal

Note: Score is temporarily shown as “–” until more multi-source data is aggregated. If a model has multiple variants, only the best-performing one is listed. ELO values include the corresponding confidence interval.

Deep Into the Data Sources

This leaderboard is built on the following independent evaluations. Explore each source to see their methodology and full results.

OpenAI Official Benchmarks

Coming soon

Core benchmark results and conclusions from the GDPval paper.

GDPval-AA

View full leaderboard →

Independent benchmark by Artificial Analysis, with ELO rankings.

Zoom In On A Single Model

Want to understand how a specific model performs across tasks and data sources? Explore our in‑depth model profile pages.

GPT Family

Coming soon

Planned deep‑dive across GPT variants, including high, codex, and mini modes.

Claude Family

Coming soon

Planned coverage of Claude variants across extended thinking, reasoning, and enterprise use cases.