Structured Outputs Evaluation
Structured Output Evaluation Cookbook
This notebook walks you through a set of focused, runnable examples how to use the OpenAI Evals framework to test, grade, and iterate on tasks that require large‑language models to produce structured outputs.
Why does this matter?
Production systems often depend on JSON, SQL, or domain‑specific formats. Relying on spot checks or ad‑hoc prompt tweaks quickly breaks down. Instead, you can codify expectations as automated evals and let your team ship with safety bricks instead of sand.
Quick Tour
- Section 1 – Prerequisites: environment variables and package setup
- Section 2 – Walk‑through: Code‑symbol extraction: end‑to‑end demo that grades the model’s ability to extract function and class names from source code. We keep the original logic intact and simply layer documentation around it.
- Section 3 – Additional Recipes: sketches of common production patterns such as sentiment extraction as additional code sample for evaluation.
- Section 4 – Result Exploration: lightweight helpers for pulling run output and digging into failures.
Prerequisites
- Install dependencies (minimum versions shown):
pip install --upgrade openai
- Authenticate by exporting your key:
export OPENAI_API_KEY="sk‑..."
- Optional: if you plan to run evals in bulk, set up an organization‑level key with appropriate limits.
Use Case 1: Code symbol extraction
The goal is to extract all function, class, and constant symbols from python files inside the OpenAI SDK.
For each file we ask the model to emit structured JSON like:
{
"symbols": [
{"name": "OpenAI", "kind": "class"},
{"name": "Evals", "kind": "module"},
...
]
}
A rubric model then grades completeness (did we capture every symbol?) and quality (are the kinds correct?) on a 1‑7 scale.
Evaluating Code Quality Extraction with a Custom Dataset
Let us walk though an example to evaluate a model's ability to extract symbols from code using the OpenAI Evals framework with a custom in-memory dataset.
Initialize SDK client
Creates an openai.OpenAI client using the OPENAI_API_KEY we exported above. Nothing will run without this.
[notice] A new release of pip is available: 24.0 -> 25.1.1 [notice] To update, run: pip install --upgrade pip Note: you may need to restart the kernel to use updated packages.
Dataset factory & grading rubric
get_datasetbuilds a small in-memory dataset by reading several SDK files.structured_output_graderdefines a detailed evaluation rubric.client.evals.create(...)registers the eval with the platform.
Kick off model runs
Here we launch two runs against the same eval: one that calls the Completions endpoint, and one that calls the Responses endpoint.
Utility poller
Next, we will use a simple loop that waits for all runs to finish, then saves each run’s JSON to disk so you can inspect it later or attach it to CI artifacts.
Load outputs for quick inspection
We will fetch the output items for both runs so we can print or post‑process them.
Human-readable dump
Let us print a side-by-side view of completions vs responses.
Visualize the Results
Below are visualizations that represent the evaluation data and code outputs for structured QA evaluation. These images provide insights into the data distribution and the evaluation workflow.
Evaluation Data Overview


Evaluation Code Workflow

By reviewing these visualizations, you can better understand the structure of the evaluation dataset and the steps involved in evaluating structured outputs for QA tasks.
Use Case 2: Multi-lingual Sentiment Extraction
In a similar way, let us evaluate a multi-lingual sentiment extraction model with structured outputs.
Visualize evals data

Summary and Next Steps
In this notebook, we have demonstrated how to use the OpenAI Evaluation API to evaluate a model's performance on a structured output task.
Next steps:
- We encourage you to try out the API with your own models and datasets.
- You can also explore the API documentation for more details on how to use the API.
For more information, see the OpenAI Evals documentation.