Clinical Triage Pipeline
Emergency Triage LLM Evaluation ๐ฅ
Problem: Accurately labeling Emergency Severity Index (ESI) levels from nurse triage notes is crucial for clinical research and model development. However, real-world data is highly sensitive and access is often limited. Obtaining high-quality human annotations is also costly and slow, which makes it challenging to create large, diverse datasets for robust evaluation.
Solution: Synthetic data provides a scalable and privacy-preserving approach. By simulating realistic triage notes and ESI labels, we can build rich datasets without exposing patient information or relying heavily on human annotators. This strategy accelerates iteration, benchmarking, and model improvement, addressing key bottlenecks caused by data scarcity.
- Use case: Predict ESI levels from synthetic nurse triage notes using LLMs
- Goal: Evaluate model accuracy and the quality/complexity of generated notes across a range of clinical scenarios
- Pipeline: Synthetic data โ LLM-as-a-Judge scoring โ Filtering โ Evaluation
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ NeMo Data Designer โ โ NeMo Evaluator โ
โ +------------------------+ โ โ +-----------------------+ โ
โ | Nurse Triage Note ๐ |โโโโผโโโโโโโโถ| | LLM predicts ESI ๐๐ค | โ
โ +------------------------+ โ โ +-----------------------+ โ
โ + โ โ | โ
โ โ โ v โ
โ +------------------------+ โ โ +-----------------------+ โ
โ | Ground Truth (ESI) โ
|โโโโผโโโโโโโโถ| | Predicted ESI ๐ท๏ธ | โ
โ +------------------------+ โ โ +-----------------------+ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ | โ
โ v โ
โ +-----------------------+ โ
โ | Metrics ๐ | โ
โ | (Accuracy) | โ
โ +-----------------------+ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Workflow Overview:
- ๐๏ธ Generate realistic, privacy-safe triage notes, evaluate their quality using LLM-as-a-Judge, and filter for high-value examples with Data Designer.
- โฌ๏ธ Upload the resulting dataset to a compatible datastore (e.g., HuggingFace Datasets).
- ๐ Use the Evaluator to compute ESI classification accuracy and other relevant metrics.
Tip: Run the cells below in order. You can re-run data preview/generation to explore different clinical scenarios and difficulty settings.
Step 1: ๐จ NeMo Data Designer
๐ฒ Sampler columns
๐ฆ LLM-generated columns
โ๏ธ LLM-as-a-Judge Evaluation Step
๐งช Generate & Preview
Tip: Re-run preview to cycle examples; adjust prompts, temperatures, or scenarios to tune realism and difficulty.
๐ Scale Up Generations
Once satisfied with the preview results, scale up to generate the full dataset.
๐งน Refinement [Optional]
Filter the generated dataset to retain only higher-quality triage notes:
- Keeps only notes with Clinical Coherence โฅ 2 (as judged by LLM).
- Retrieves ESI level complexity directly from the LLM judge column (
triage_note_quality).
๐ Inspect results
Step 2: ๐ Nemo Evaluator
We evaluate the model on filtered triage notes to see if it predicts the correct ESI level.
- Dataset: HF-compatible JSONL served by the datastore
- Task: Completion with structured output
{ "esi_level_description": "..." } - Metric: String containment check against ground-truth ESI
๐งช Evaluator Flow
This section defines the evaluation configuration used to assess model performance on triage note classification using a custom evaluator.
๐ Model evaluation loop and configuration
This section compares multiple models (A/B testing) on the triage note classification task across each complexity level (Simple, Moderate, Complex).
The models evaluated are:
- Qwen3-8B (
Qwen/Qwen3-8B) - Nemotron Nano 9B v2 (
nvidia/nvidia-nemotron-nano-9b-v2)
For each complexity level, the accuracy score for each model is printed, allowing for side-by-side evaluation of how each model performs at every complexity.
๐ Visualize Model Accuracies
The table below summarizes the accuracy (%) of each model for each complexity level.