Notebooks
N
NVIDIA
Llm As A Judge Notebook

Llm As A Judge Notebook

gpu-accelerationEvaluatorretrieval-augmented-generationllm-inferencetensorrtnvidia-generative-ai-exampleslarge-language-modelsmicroservicetriton-inference-serverCustom LLM-as-a-JudgeLLMragnemo

Custom LLM-as-a-Judge Implementation

In the following notebook, we'll be walking through an example of how you can leverage Custom LLM-as-a-Judge through NeMo Evaluator Microservice.

Full documentation is available here!

In our example - we'll be looking at the following scenario:

We have a JSONL file with medical consultation information (synthetically generated). We will use a build.nvidia.com endpoint model to generate summaries of those consultations - and then use OpenAI to judge the summaries on metrics we define ahead of time - in this case: Correctness and Completeness.

We'll note different places you could change this example to adjust to your desired workflow along the way, as Custom LLM-as-a-Judge is a flexible evaluation!

Necessary Configurations

You'll need to have set up the NeMo Microservices including:

  • NeMo Evaluator
  • NeMo Data Store and Entity Store

If you wish to evaluate a NIM for LLMs, or use a NIM for LLMs as a judge, you will also need to provide the respective NIM for LLMs URL.

[8]
[5]
Data Store endpoint: https://nmp.int.aire.nvidia.com
Entity Store, Customizer, Evaluator endpoint: https://datastore.int.aire.nvidia.com
NIM endpoint: https://nim.int.aire.nvidia.com
Namespace: custom-llm-as-a-judge-eval

Setting Up NeMo Data Store and Entity Store

We'll first need to ensure that our namespace is created and is available both in our NeMo Entity Store and Data Store.

[9]
<Response [200]>
<Response [201]>

Now we can do a simple verification.

[10]
Status Code: 201
Response JSON: {'namespace': 'custom-llm-as-a-judge-eval-v1', 'created_at': '2025-05-08T17:19:19Z', 'updated_at': '2025-05-08T17:19:19Z'}
Status Code: 200
Response JSON: {'id': 'custom-llm-as-a-judge-eval-v1', 'created_at': '2025-05-08T17:19:19.316626', 'updated_at': '2025-05-08T17:19:19.316630', 'description': None, 'project': None, 'custom_fields': {}, 'ownership': None}

Creating a Repository for our Data

Next, we'll want to create a repository on our NeMo Data Store!

We'll start by defining our repository ID.

[11]

Next, we can use the Hugging Face Hub API to create the repository.

[ ]
/home/jupiter-core/Code/NVIDIA/GenerativeAIExamples/nemo/Evaluator/Custom LLM-as-a-Judge/.venv/lib/python3.13/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm
RepoUrl('datasets/custom-llm-as-a-judge-eval-v1/custom-llm-as-a-judge-eval-data-v1', endpoint='https://datastore.int.aire.nvidia.com/v1/hf', repo_type='dataset', repo_id='custom-llm-as-a-judge-eval-v1/custom-llm-as-a-judge-eval-data-v1')

We're going to upload our data to the NeMo Data Store, but before we do - let's take a look at it!

Here's an example of a row of data:

{
    "ID": "C012", 
    "content": "Date: 2025-04-12\nChief Complaint (CC): ...", 
    "summary": "New Clinical Problem: ..."
}

As you can see, we have a content field with a synthetically generated medical consultation, as well as a summary field with an AI generated summary.

NOTE: In this example we won't be directly leveraging the summary field - but we'll cover how you would be able to leverage extra fields if they were necessary!

Next, let's upload our file directly to our newly created repository using the following code cell!

[13]
CommitInfo(commit_url='', commit_message='Upload doctor_consults_with_summaries.jsonl with huggingface_hub', commit_description='', oid='6f73f13bc78005f7edc7306437aa61a758e0c560', pr_url=None, repo_url=RepoUrl('', endpoint='https://huggingface.co', repo_type='model', repo_id=''), pr_revision=None, pr_num=None)

Next, we'll registed the dataset with NeMo Entity Store!

This will allow us to leverage this dataset for evaluation jobs - through the /v1/datasets/ endpoint, which will allow us to refer to the dataset by it's namespace and name.

[14]
{'created_at': '2025-05-08T17:19:35.030601',
, 'updated_at': '2025-05-08T17:19:35.030605',
, 'name': 'custom-llm-as-a-judge-eval-data-v1',
, 'namespace': 'custom-llm-as-a-judge-eval-v1',
, 'description': 'LLM As a Judge Test',
, 'format': None,
, 'files_url': 'hf://datasets/custom-llm-as-a-judge-eval-v1/custom-llm-as-a-judge-eval-data-v1',
, 'hf_endpoint': None,
, 'split': None,
, 'limit': None,
, 'id': 'dataset-CU15CaykeHJJPTKrZDJw68',
, 'project': 'custom-llm-as-a-judge-test',
, 'custom_fields': {}}

Now, let's verify it landed.

[15]
Files URL: hf://datasets/custom-llm-as-a-judge-eval-v1/custom-llm-as-a-judge-eval-data-v1

NeMo Evaluator Set-Up

In the following steps, we'll make a few assumptions:

  1. You will be using an OpenAI model as the Judge LLM
  2. You will be using a build.nvidia.com model to generate responses.

Each of these models can be changed to accomodate NIM for LLMs, or any OpenAI API compatible models.

Evaluation Configuration Set Up

In order to use both the OpenAI model, and the build.nvidia.com model, we'll need to provide our API keys for both!

NOTE: You can find the API key on build.nvidia.com by clicking the green "Get API Key" button!

[16]
[17]

Judge LLM Configuration

In the following cell - we'll going to set our Judge LLM configuration - while the example provided is for an OpenAI model - you could change this to point at any Judge LLM you'd like that is compatible with NeMo Evaluator.

This includes, but is not limited to:

Completion Endpoints

"api_endpoint": {
    "url": "<my-nim-deployment-base-url>/chat/completions",
    "model_id": "<my-model>"
}

External Endpoint

"api_endpoint": {
    "url": "<external-openai-compatible-base-url>/chat/completions",
    "model_id": "<external-model>",
    "api_key": "<my-api-key>",
    "format": "openai"        
}

You can check out more examples on this page of the documentation.

[18]

Let's build a few prompt templates we can use for our Judge LLM to judge the produced summary on a few different metrics!

[19]

Let's also set up our user prompt, which we'll use across both metrics.

Notice that we can reference items in our dataset through the {{ item.content }} template. If we wanted to address our summaries, we could instead use {{ item.summary }}!

Also notice that we can address the generation from our target LLM with the {{ sample.output_text }}.

[20]

We will also need to pass a regex parser so we can collect the numeric scores from our prompt - for this reason, it's important to specify in the system prompt some easily identifiable score extraction sequence.

In the example system prompt above, you'll notice we used:

"Please respond with RATING: <number>"

This allows us to use the following parser to collect our scores.

"scores": { 
    "completeness": { 
        "type": "int",
        "parser": {
            "type": "regex",
            "pattern": r"RATING:\s*(\d+)"
        }
    },
}

Now that we've got the atomic parts of our Custom LLM-as-a-Judge evaluation configuration in place - let's build the whole thing!

NOTE: We're using two metrics here correctness and completeness - but you can define more (or a single metric) as you see fit!

[21]

Target Configuration

Just as with the Judge LLM - you can identify any targets, please see the Target documentation for more examples!

We're going to be using Llama 3.1 70B as our model to be tested in this example.

[27]

Evaluation Job and Status

At this point - we're ready to kick-off our Evaluation Job as we've prepared both our Evaluation Configuration and our Target configuration!

[28]
{'created_at': '2025-05-08T17:21:24.210032',
, 'updated_at': '2025-05-08T17:21:24.210033',
, 'id': 'eval-CjUYLQdriBtAA5X9KPDLAU',
, 'namespace': 'default',
, 'description': None,
, 'target': {'schema_version': '1.0',
,  'id': 'eval-target-V5VgZ54rJBeoD8w1WmzSHn',
,  'description': None,
,  'type_prefix': 'eval-target',
,  'namespace': 'default',
,  'project': None,
,  'created_at': '2025-05-08T17:21:24.209441',
,  'updated_at': '2025-05-08T17:21:24.209442',
,  'custom_fields': {},
,  'ownership': None,
,  'name': 'eval-target-V5VgZ54rJBeoD8w1WmzSHn',
,  'type': 'model',
,  'cached_outputs': None,
,  'model': {'schema_version': '1.0',
,   'id': 'model-R1gn9w1tPwdfukCDoCBF2F',
,   'description': None,
,   'type_prefix': 'model',
,   'namespace': 'default',
,   'project': None,
,   'created_at': '2025-05-08T17:21:24.209465',
,   'updated_at': '2025-05-08T17:21:24.209465',
,   'custom_fields': {},
,   'ownership': None,
,   'name': 'model-R1gn9w1tPwdfukCDoCBF2F',
,   'version_id': 'main',
,   'version_tags': [],
,   'spec': None,
,   'artifact': None,
,   'base_model': None,
,   'api_endpoint': {'url': 'https://integrate.api.nvidia.com/v1/chat/completions',
,    'model_id': 'meta/llama-3.1-70b-instruct',
,    'api_key': '******',
,    'format': 'nim'},
,   'peft': None,
,   'prompt': None,
,   'guardrails': None},
,  'retriever': None,
,  'rag': None,
,  'rows': None,
,  'dataset': None},
, 'config': {'schema_version': '1.0',
,  'id': 'eval-config-MGiCNpVtA1vKP3e7Npqm3P',
,  'description': None,
,  'type_prefix': 'eval-config',
,  'namespace': 'default',
,  'project': None,
,  'created_at': '2025-05-08T17:21:24.209320',
,  'updated_at': '2025-05-08T17:21:24.209322',
,  'custom_fields': {},
,  'ownership': None,
,  'name': 'doctor_consult_summary_eval',
,  'type': 'custom',
,  'params': None,
,  'tasks': {'consult_summary_eval': {'type': 'chat-completion',
,    'params': {'template': {'messages': [{'role': 'system',
,        'content': 'Given a full medical consultation, please provide a 50 word summary of the consultation.'},
,       {'role': 'user', 'content': 'Full Consult: {{ item.content }}'}],
,      'max_tokens': 200}},
,    'metrics': {'completeness': {'type': 'llm-judge',
,      'params': {'model': {'api_endpoint': {'url': 'https://api.openai.com/v1/chat/completions',
,         'model_id': 'gpt-4.1',
,         'api_key': '******'}},
,       'template': {'messages': [{'role': 'system',
,          'content': '\nYou are a judge. Rate how complete the summary is \non a scale from 1 to 5:\n1 = missing critical information … 5 = fully complete\nPlease respond with RATING: <number>\n'},
,         {'role': 'user',
,          'content': '\nFull Consult: {{ item.content }}\nSummary: {{ sample.output_text }}\n'}]},
,       'scores': {'completeness': {'type': 'int',
,         'parser': {'type': 'regex', 'pattern': 'RATING:\\s*(\\d+)'}}}}},
,     'correctness': {'type': 'llm-judge',
,      'params': {'model': {'api_endpoint': {'url': 'https://api.openai.com/v1/chat/completions',
,         'model_id': 'gpt-4.1',
,         'api_key': '******'}},
,       'template': {'messages': [{'role': 'system',
,          'content': "\nYou are a judge. Rate the summary's correctness \n(no false info) on a scale 1-5:\n1 = many inaccuracies … 5 = completely accurate\nPlease respond with RATING: <number>\n"},
,         {'role': 'user',
,          'content': '\nFull Consult: {{ item.content }}\nSummary: {{ sample.output_text }}\n'}]},
,       'scores': {'correctness': {'type': 'int',
,         'parser': {'type': 'regex', 'pattern': 'RATING:\\s*(\\d+)'}}}}}},
,    'dataset': {'schema_version': '1.0',
,     'id': 'dataset-9dhgzV4RtNi1vceLtG9b37',
,     'description': None,
,     'type_prefix': None,
,     'namespace': 'default',
,     'project': None,
,     'created_at': '2025-05-08T17:21:24.209381',
,     'updated_at': '2025-05-08T17:21:24.209381',
,     'custom_fields': {},
,     'ownership': None,
,     'name': 'dataset-9dhgzV4RtNi1vceLtG9b37',
,     'version_id': 'main',
,     'version_tags': [],
,     'format': None,
,     'files_url': 'hf://datasets/custom-llm-as-a-judge-eval-v1/custom-llm-as-a-judge-eval-data-v1/',
,     'hf_endpoint': None,
,     'split': None,
,     'limit': 25}}},
,  'groups': None},
, 'result': None,
, 'output_files_url': None,
, 'status_details': {'message': None, 'task_status': {}, 'progress': None},
, 'status': 'created',
, 'project': None,
, 'custom_fields': {},
, 'ownership': None}

We'll use the following helper function to wait for our job to be completed.

[29]

The job itself may take ~250-300s to complete, depending on hardware, models used, and other factors.

[30]
Job status: running after 6.03 seconds. Progress: 8.0%
Job status: running after 14.26 seconds. Progress: 16.0%
Job status: running after 21.37 seconds. Progress: 20.0%
Job status: running after 26.88 seconds. Progress: 28.0%
Job status: running after 32.41 seconds. Progress: 32.0%
Job status: running after 37.93 seconds. Progress: 40.0%
Job status: running after 43.44 seconds. Progress: 44.0%
Job status: running after 48.95 seconds. Progress: 48.0%
Job status: running after 56.06 seconds. Progress: 56.0%
Job status: running after 61.57 seconds. Progress: 64.0%
Job status: running after 67.08 seconds. Progress: 72.0%
Job status: running after 72.59 seconds. Progress: 72.0%
Job status: running after 78.12 seconds. Progress: 80.0%
Job status: running after 83.64 seconds. Progress: 84.0%
Job status: running after 90.75 seconds. Progress: 88.0%
Job status: running after 96.26 seconds. Progress: 96.0%
Job status: completed after 101.78 seconds. Progress: 100%

Now we can verify our job is complete!

[31]
{'created_at': '2025-05-08T17:21:24.210032', 'updated_at': '2025-05-08T17:23:09.228092', 'id': 'eval-CjUYLQdriBtAA5X9KPDLAU', 'namespace': 'default', 'description': None, 'target': {'schema_version': '1.0', 'id': 'eval-target-V5VgZ54rJBeoD8w1WmzSHn', 'description': None, 'type_prefix': 'eval-target', 'namespace': 'default', 'project': None, 'created_at': '2025-05-08T17:21:24.209441', 'updated_at': '2025-05-08T17:21:24.209442', 'custom_fields': {}, 'ownership': None, 'name': 'eval-target-V5VgZ54rJBeoD8w1WmzSHn', 'type': 'model', 'cached_outputs': None, 'model': {'schema_version': '1.0', 'id': 'model-R1gn9w1tPwdfukCDoCBF2F', 'description': None, 'type_prefix': 'model', 'namespace': 'default', 'project': None, 'created_at': '2025-05-08T17:21:24.209465', 'updated_at': '2025-05-08T17:21:24.209465', 'custom_fields': {}, 'ownership': None, 'name': 'model-R1gn9w1tPwdfukCDoCBF2F', 'version_id': 'main', 'version_tags': [], 'spec': None, 'artifact': None, 'base_model': None, 'api_endpoint': {'url': 'https://integrate.api.nvidia.com/v1/chat/completions', 'model_id': 'meta/llama-3.1-70b-instruct', 'api_key': '******', 'format': 'nim'}, 'peft': None, 'prompt': None, 'guardrails': None}, 'retriever': None, 'rag': None, 'rows': None, 'dataset': None}, 'config': {'schema_version': '1.0', 'id': 'eval-config-MGiCNpVtA1vKP3e7Npqm3P', 'description': None, 'type_prefix': 'eval-config', 'namespace': 'default', 'project': None, 'created_at': '2025-05-08T17:21:24.209320', 'updated_at': '2025-05-08T17:21:24.209322', 'custom_fields': {}, 'ownership': None, 'name': 'doctor_consult_summary_eval', 'type': 'custom', 'params': None, 'tasks': {'consult_summary_eval': {'type': 'chat-completion', 'params': {'template': {'messages': [{'role': 'system', 'content': 'Given a full medical consultation, please provide a 50 word summary of the consultation.'}, {'role': 'user', 'content': 'Full Consult: {{ item.content }}'}], 'max_tokens': 200}}, 'metrics': {'completeness': {'type': 'llm-judge', 'params': {'model': {'api_endpoint': {'url': 'https://api.openai.com/v1/chat/completions', 'model_id': 'gpt-4.1', 'api_key': '******'}}, 'template': {'messages': [{'role': 'system', 'content': '\nYou are a judge. Rate how complete the summary is \non a scale from 1 to 5:\n1 = missing critical information … 5 = fully complete\nPlease respond with RATING: <number>\n'}, {'role': 'user', 'content': '\nFull Consult: {{ item.content }}\nSummary: {{ sample.output_text }}\n'}]}, 'scores': {'completeness': {'type': 'int', 'parser': {'type': 'regex', 'pattern': 'RATING:\\s*(\\d+)'}}}}}, 'correctness': {'type': 'llm-judge', 'params': {'model': {'api_endpoint': {'url': 'https://api.openai.com/v1/chat/completions', 'model_id': 'gpt-4.1', 'api_key': '******'}}, 'template': {'messages': [{'role': 'system', 'content': "\nYou are a judge. Rate the summary's correctness \n(no false info) on a scale 1-5:\n1 = many inaccuracies … 5 = completely accurate\nPlease respond with RATING: <number>\n"}, {'role': 'user', 'content': '\nFull Consult: {{ item.content }}\nSummary: {{ sample.output_text }}\n'}]}, 'scores': {'correctness': {'type': 'int', 'parser': {'type': 'regex', 'pattern': 'RATING:\\s*(\\d+)'}}}}}}, 'dataset': {'schema_version': '1.0', 'id': 'dataset-9dhgzV4RtNi1vceLtG9b37', 'description': None, 'type_prefix': None, 'namespace': 'default', 'project': None, 'created_at': '2025-05-08T17:21:24.209381', 'updated_at': '2025-05-08T17:21:24.209381', 'custom_fields': {}, 'ownership': None, 'name': 'dataset-9dhgzV4RtNi1vceLtG9b37', 'version_id': 'main', 'version_tags': [], 'format': None, 'files_url': 'hf://datasets/custom-llm-as-a-judge-eval-v1/custom-llm-as-a-judge-eval-data-v1/', 'hf_endpoint': None, 'split': None, 'limit': 25}}}, 'groups': None}, 'result': 'evaluation_result-QWzZVGtKY1Z7hdty63hLfP', 'output_files_url': 'hf://datasets/evaluation-results/eval-CjUYLQdriBtAA5X9KPDLAU', 'status_details': {'message': 'Job completed successfully.', 'task_status': {'consult_summary_eval': 'completed'}, 'progress': 100.0}, 'status': 'completed', 'project': None, 'custom_fields': {}, 'ownership': None}

Now that it's complete - we can look at the scores the Custom LLM-as-a-Judge evaluation produced!

[32]
{'created_at': '2025-05-08T17:21:24.281058',
, 'updated_at': '2025-05-08T17:21:24.281059',
, 'id': 'evaluation_result-QWzZVGtKY1Z7hdty63hLfP',
, 'job': 'eval-CjUYLQdriBtAA5X9KPDLAU',
, 'tasks': {'consult_summary_eval': {'metrics': {'completeness': {'scores': {'completeness': {'value': 4.92,
,       'stats': {'count': 25, 'sum': 123.0, 'mean': 4.92}}}},
,    'correctness': {'scores': {'correctness': {'value': 4.92,
,       'stats': {'count': 25, 'sum': 123.0, 'mean': 4.92}}}}}}},
, 'groups': {},
, 'namespace': 'default',
, 'custom_fields': {}}