GDPval: Measuring AI Performance on Real-World Economically Valuable Tasks

GDPval is a new evaluation framework that measures model performance on economically valuable, real-world tasks across 44 occupations from the top 9 industries contributing to U.S. GDP. The GDPval benchmark provides a comprehensive assessment of how AI models perform on actual knowledge work tasks.

Coverage

44 occupations
Across 9 industries contributing most to U.S. GDP.

Tasks

1,320 tasks
220 tasks in the open-sourced gold set, crafted by experienced professionals.

Performance

100x faster
Frontier models complete tasks roughly 100x faster and cheaper than experts.

Why GDPval

GDPval tasks are based on real work products from experienced professionals (averaging 14 years of experience), ensuring economic relevance beyond synthetic benchmarks. The GDPval evaluation framework bridges the gap between academic tests and real-world economic value.

What Makes GDPval Different

Unlike simple text prompts, GDPval tasks include reference files and context, with deliverables spanning documents, slides, diagrams, spreadsheets, and multimedia. This makes GDPval a more realistic test of how models might support professionals in their daily work.

What is GDPval?

GDPval is an evaluation framework designed to track how well AI models perform on economically valuable, real-world tasks. The GDPval benchmark represents the next step in measuring increasingly realistic and economically relevant AI capabilities.

GDPval overview

GDPval measures model performance on tasks drawn directly from the real-world knowledge work of experienced professionals across 44 occupations and 9 sectors. The GDPval evaluation spans 1,320 specialized tasks, with each GDPval task based on real work products, such as a legal brief, an engineering blueprint, a customer support conversation, or a nursing care plan.

The GDPval dataset includes 1,320 specialized tasks (220 in the gold open-sourced set), each meticulously crafted and vetted by experienced professionals with over 14 years of experience on average. GDPval tasks come with reference files and context, and deliverables span documents, slides, diagrams, spreadsheets, and multimedia. This realism makes GDPval a more realistic test of how models might support professionals.

Early GDPval results show that today's best frontier models are already approaching the quality of work produced by industry experts when evaluated on GDPval tasks, and can complete GDPval tasks roughly 100x faster and 100x cheaper than experts.

GDPval Reports

Explore GDPval results and model performance across industries, occupations, and time. The GDPval reports provide comprehensive insights into how leading models perform on economically valuable tasks.

By Industry & Occupation

GDPval performance across 9 industries and 44 occupations, from software developers and lawyers to registered nurses and mechanical engineers. See how models perform on GDPval tasks within each sector.

Model Performance Over Time

Track linear improvement of frontier models on GDPval from GPT-4o to GPT-5, showing how GDPval performance more than doubled in a year. The GDPval benchmark reveals clear progress over time.

Win Rate vs. Experts

See how leading models (GPT-5, Claude Opus 4.1, Gemini 2.5 Pro, and others) compare to industry experts on GDPval deliverable quality. The GDPval evaluation shows which models excel on different aspects of knowledge work.

Speed & Cost Comparison

Compare how frontier models complete GDPval tasks roughly 100x faster and 100x cheaper than industry experts. The GDPval benchmark reveals significant efficiency gains on economically valuable tasks.

Impact of Reasoning & Context

See how increased reasoning effort, increased task context, and increased scaffolding improve model performance.

Export & Sharing

Download charts and CSVs to brief stakeholders, compare models, and plan production rollouts.

How GDPval Works

Understanding the GDPval evaluation methodology and grading process. Learn how GDPval measures model performance on real-world knowledge work tasks.

How GDPval Grades Performance

  • GDPval uses expert graders who blindly compare model-generated deliverables with those produced by task writers
  • GDPval graders rank deliverables and classify each as "better", "as good as", or "worse than" human work
  • GDPval task writers create detailed scoring rubrics for consistency and transparency in the GDPval evaluation
  • GDPval includes an automated grader providing experimental research service at evals.openai.com

GDPval Dataset Construction

GDPval Task Creation

For each occupation in GDPval, we worked with experienced professionals (averaging 14 years of experience) to create representative GDPval tasks reflecting their day-to-day work. The GDPval dataset represents real knowledge work across diverse sectors.

GDPval Review Process

  • Each GDPval task received an average of 5 rounds of expert review
  • GDPval tasks checked by other task writers, additional occupational reviewers, and model-based validation
  • 30 fully reviewed GDPval tasks per occupation in the full set
  • 5 GDPval tasks per occupation in the open-sourced gold set

Why Choose GDPval

GDPval offers ground truth tasks, economic relevance, and actionable insights. The GDPval benchmark stands out for its realism and comprehensive coverage of knowledge work.

Real Work Products

GDPval tasks are based on actual deliverables like legal briefs, engineering blueprints, customer support conversations, and nursing care plans. Each GDPval task reflects real knowledge work.

Expert Evaluation

GDPval uses expert graders from the same occupations who blindly compare model outputs with human work, ranking them as better, as good as, or worse. The GDPval evaluation process ensures fair and accurate assessment.

Linear Progress

GDPval performance more than doubled from GPT-4o (spring 2024) to GPT-5 (summer 2025), following a clear linear trend. The GDPval benchmark tracks meaningful progress over time.

Speed & Cost Advantage

Frontier models can complete GDPval tasks roughly 100x faster and 100x cheaper than industry experts.

Early GDPval Results

Key findings from the GDPval evaluation. Discover what GDPval reveals about model performance on economically valuable tasks.

Today's best frontier models are already approaching the quality of work produced by industry experts on GDPval tasks. According to GDPval results, Claude Opus 4.1 was the best performing model on GDPval, excelling in particular on aesthetics (e.g., document formatting, slide layout), while GPT-5 excelled in particular on accuracy (e.g., finding domain-specific knowledge) in the GDPval evaluation.

GDPval Model Performance
Across 220 GDPval tasks in the gold set

GDPval performance has more than doubled from GPT-4o (released spring 2024) to GPT-5 (released summer 2025), following a clear linear trend on the GDPval benchmark. GDPval results also show that increasing model size, encouraging more reasoning steps, and giving richer task context each led to measurable gains in GDPval performance.

GDPval Linear Improvement
GDPval performance more than doubled in a year

Frontier models can complete GDPval tasks roughly 100x faster and 100x cheaper than industry experts, according to GDPval analysis. However, these GDPval figures reflect pure model inference time and API billing rates, and do not capture the human oversight, iteration, and integration steps required in real workplace settings when using GDPval results.

GDPval Speed & Cost
100x faster and cheaper on GDPval tasks

We're releasing a gold subset of 220 GDPval tasks and a public automated grading service at evals.openai.com to facilitate future research in understanding real-world model capabilities. The GDPval gold subset enables researchers to build on this work and compare model performance consistently using the GDPval benchmark.

📊
GDPval Open Source
220 GDPval gold tasks available for research

Frequently Asked Questions

Key details about GDPval

GDPval spans 44 occupations selected from the top 9 industries contributing to U.S. GDP. The GDPval full set includes 1,320 specialized tasks (220 in the GDPval gold open-sourced set), each GDPval task crafted by experienced professionals with over 14 years of experience on average. GDPval covers knowledge work across diverse sectors.

The GDPval gold subset includes 220 GDPval tasks (5 per occupation) that are publicly available for researchers. We're also releasing a public automated grading service for GDPval at evals.openai.com as an experimental research service to facilitate future research using the GDPval benchmark.

GDPval uses expert graders who blindly compare model-generated deliverables with those produced by task writers, ranking them as "better", "as good as", or "worse than" human work in the GDPval evaluation. GDPval also tracks speed and cost, finding that frontier models complete GDPval tasks roughly 100x faster and 100x cheaper than experts.

GDPval is an early step in measuring economically valuable tasks. The current GDPval version is one-shot, so it doesn't capture cases where a model would need to build context or improve through multiple drafts. Future GDPval versions will expand to more occupations, industries, and task types, with increased interactivity in the GDPval evaluation framework.

We're releasing a GDPval gold subset of 220 tasks and a public automated grading service for GDPval at evals.openai.com. You can read the full GDPval results in our paper and access the GDPval gold subset to build on this work using the GDPval benchmark.

Explore GDPval

See how frontier models perform on GDPval tasks, compare GDPval results to industry experts, and understand the potential for AI to support professionals in their daily work through the GDPval evaluation framework.

View GDPval Reports