NVIDIA Clinical Triage Pipeline

Clinical Triage Pipeline

gpu-accelerationretrieval-augmented-generationllm-inferencetensorrtnvidia-generative-ai-exampleslarge-language-modelsmicroservicetriton-inference-serverLLMdata-flywheelragnemoe2e-llm-evaluation

alph-notebooks/nvidia-generative-ai-examples / clinical_triage_pipeline.ipynb

Export

Run Notebooks

Contents

No cells yet

Add cells to see them here

Emergency Triage LLM Evaluation 🏥

Problem: Accurately labeling Emergency Severity Index (ESI) levels from nurse triage notes is crucial for clinical research and model development. However, real-world data is highly sensitive and access is often limited. Obtaining high-quality human annotations is also costly and slow, which makes it challenging to create large, diverse datasets for robust evaluation.

Solution: Synthetic data provides a scalable and privacy-preserving approach. By simulating realistic triage notes and ESI labels, we can build rich datasets without exposing patient information or relying heavily on human annotators. This strategy accelerates iteration, benchmarking, and model improvement, addressing key bottlenecks caused by data scarcity.

Use case: Predict ESI levels from synthetic nurse triage notes using LLMs
Goal: Evaluate model accuracy and the quality/complexity of generated notes across a range of clinical scenarios
Pipeline: Synthetic data ➔ LLM-as-a-Judge scoring ➔ Filtering ➔ Evaluation

 ┌───────────────────────────────┐        ┌─────────────────────────────┐
 │      NeMo Data Designer       │        │        NeMo Evaluator       │
 │  +------------------------+   │        │  +-----------------------+  │
 │  | Nurse Triage Note 📝   |───┼───────▶|  | LLM predicts ESI 🔍🤖 |  │
 │  +------------------------+   │        │  +-----------------------+  │
 │            +                  │        │              |              │
 │                               │        │              v              │
 │  +------------------------+   │        │  +-----------------------+  │
 │  | Ground Truth (ESI) ✅  |───┼───────▶|  |    Predicted ESI 🏷️   |  │
 │  +------------------------+   │        │  +-----------------------+  │
 └───────────────────────────────┘        │              |              │
                                          │              v              │
                                          │  +-----------------------+  │
                                          │  |      Metrics 📊       |  │
                                          │  |     (Accuracy)        |  │
                                          │  +-----------------------+  │
                                          └─────────────────────────────┘

Workflow Overview:

🏗️ Generate realistic, privacy-safe triage notes, evaluate their quality using LLM-as-a-Judge, and filter for high-value examples with Data Designer.
⬆️ Upload the resulting dataset to a compatible datastore (e.g., HuggingFace Datasets).
📈 Use the Evaluator to compute ESI classification accuracy and other relevant metrics.

Tip: Run the cells below in order. You can re-run data preview/generation to explore different clinical scenarios and difficulty settings.

Step 1: 🎨 NeMo Data Designer

[ ]

🎲 Sampler columns

[ ]

🦜 LLM-generated columns

[ ]

⚖️ LLM-as-a-Judge Evaluation Step

[ ]

🧪 Generate & Preview

Tip: Re-run preview to cycle examples; adjust prompts, temperatures, or scenarios to tune realism and difficulty.

[ ]

🚀 Scale Up Generations

Once satisfied with the preview results, scale up to generate the full dataset.

[ ]

🧹 Refinement [Optional]

Filter the generated dataset to retain only higher-quality triage notes:

Keeps only notes with Clinical Coherence ≥ 2 (as judged by LLM).
Retrieves ESI level complexity directly from the LLM judge column (triage_note_quality).

[ ]

👀 Inspect results

[ ]

Step 2: 📊 Nemo Evaluator

We evaluate the model on filtered triage notes to see if it predicts the correct ESI level.

Dataset: HF-compatible JSONL served by the datastore
Task: Completion with structured output { "esi_level_description": "..." }
Metric: String containment check against ground-truth ESI

[ ]

🧪 Evaluator Flow

This section defines the evaluation configuration used to assess model performance on triage note classification using a custom evaluator.

[ ]

🔍 Model evaluation loop and configuration

This section compares multiple models (A/B testing) on the triage note classification task across each complexity level (Simple, Moderate, Complex).

The models evaluated are:

Qwen3-8B (Qwen/Qwen3-8B)
Nemotron Nano 9B v2 (nvidia/nvidia-nemotron-nano-9b-v2)

For each complexity level, the accuracy score for each model is printed, allowing for side-by-side evaluation of how each model performs at every complexity.

[ ]

📊 Visualize Model Accuracies

The table below summarizes the accuracy (%) of each model for each complexity level.

[ ]