OpenAI Structured Outputs Evaluation

Structured Outputs Evaluation

chatgptopenaigpt-4use-casesevaluationexamplesopenai-apiopenai-cookbook

alph-notebooks/openai-cookbook / structured-outputs-evaluation.ipynb

Export

Run Notebooks

Contents

No cells yet

Add cells to see them here

Structured Output Evaluation Cookbook

This notebook walks you through a set of focused, runnable examples how to use the OpenAI Evals framework to test, grade, and iterate on tasks that require large‑language models to produce structured outputs.

Why does this matter?
Production systems often depend on JSON, SQL, or domain‑specific formats. Relying on spot checks or ad‑hoc prompt tweaks quickly breaks down. Instead, you can codify expectations as automated evals and let your team ship with safety bricks instead of sand.

Quick Tour

Section 1 – Prerequisites: environment variables and package setup
Section 2 – Walk‑through: Code‑symbol extraction: end‑to‑end demo that grades the model’s ability to extract function and class names from source code. We keep the original logic intact and simply layer documentation around it.
Section 3 – Additional Recipes: sketches of common production patterns such as sentiment extraction as additional code sample for evaluation.
Section 4 – Result Exploration: lightweight helpers for pulling run output and digging into failures.

Prerequisites

Install dependencies (minimum versions shown):

pip install --upgrade openai

Authenticate by exporting your key:

export OPENAI_API_KEY="sk‑..."

Optional: if you plan to run evals in bulk, set up an organization‑level key with appropriate limits.

Use Case 1: Code symbol extraction

The goal is to extract all function, class, and constant symbols from python files inside the OpenAI SDK.
For each file we ask the model to emit structured JSON like:

{
  "symbols": [
    {"name": "OpenAI", "kind": "class"},
    {"name": "Evals", "kind": "module"},
    ...
  ]
}

A rubric model then grades completeness (did we capture every symbol?) and quality (are the kinds correct?) on a 1‑7 scale.

Evaluating Code Quality Extraction with a Custom Dataset

Let us walk though an example to evaluate a model's ability to extract symbols from code using the OpenAI Evals framework with a custom in-memory dataset.

Initialize SDK client

Creates an openai.OpenAI client using the OPENAI_API_KEY we exported above. Nothing will run without this.

[11]


[notice] A new release of pip is available: 24.0 -> 25.1.1
[notice] To update, run: pip install --upgrade pip
Note: you may need to restart the kernel to use updated packages.

Dataset factory & grading rubric

get_dataset builds a small in-memory dataset by reading several SDK files.
structured_output_grader defines a detailed evaluation rubric.
client.evals.create(...) registers the eval with the platform.

[4]

Kick off model runs

Here we launch two runs against the same eval: one that calls the Completions endpoint, and one that calls the Responses endpoint.

[5]

Utility poller

Next, we will use a simple loop that waits for all runs to finish, then saves each run’s JSON to disk so you can inspect it later or attach it to CI artifacts.

[7]

Load outputs for quick inspection

We will fetch the output items for both runs so we can print or post‑process them.

[8]

Human-readable dump

Let us print a side-by-side view of completions vs responses.

[20]

Visualize the Results

Below are visualizations that represent the evaluation data and code outputs for structured QA evaluation. These images provide insights into the data distribution and the evaluation workflow.

Evaluation Data Overview

Evaluation Data Part 1

Evaluation Data Part 2

Evaluation Code Workflow

Evaluation Code Structure

By reviewing these visualizations, you can better understand the structure of the evaluation dataset and the steps involved in evaluating structured outputs for QA tasks.

Use Case 2: Multi-lingual Sentiment Extraction

In a similar way, let us evaluate a multi-lingual sentiment extraction model with structured outputs.

[29]

[31]

[32]

[ ]

Visualize evals data

Summary and Next Steps

In this notebook, we have demonstrated how to use the OpenAI Evaluation API to evaluate a model's performance on a structured output task.

Next steps:

We encourage you to try out the API with your own models and datasets.
You can also explore the API documentation for more details on how to use the API.

For more information, see the OpenAI Evals documentation.