Notebooks
N
NVIDIA
Reasoning Traces

Reasoning Traces

gpu-accelerationretrieval-augmented-generationllm-inferencetensorrtnvidia-generative-ai-examplesself-hosted-tutorialslarge-language-modelsmicroservicetriton-inference-servercommunity-contributionsLLMreasoningragnemoNeMo-Data-Designer

🧠 NeMo Data Designer: Synthetic Reasoning Traces

📚 What you'll learn

  • This notebook demonstrates how to use NeMo Data Designer to build a synthetic data generation pipeline tailored for reasoning tasks.

  • Instead of creating multi-turn conversations, we will generate reasoning traces that can be utilized for training and
    fine-tuning language models with reinforcement learning techniques and invoking chain-of-thought processing.

  • These synthetic reasoning traces can be used to enhance model performance in areas such as mathematics, coding, scientific
    reasoning, and other domains that benefit from structured reasoning.


👋 IMPORTANT – Environment Setup

  • If you haven't already, follow the instructions in the README to install the necessary dependencies.

  • You may need to restart your notebook's kernel after setting up the environment.

  • In this notebook, we assume you have a self-hosted instance of Data Designer up and running.

  • For deployment instructions, see the Installation Options section of the NeMo Data Designer documentation.

📦 Import the essentials

  • The data_designer module of nemo_microservices exposes Data Designer's high-level SDK.

  • The essentials module provides quick access to the most commonly used objects.

[ ]

⚙️ Initialize the NeMo Data Designer Client

  • NeMoDataDesignerClient is responsible for submitting generation requests to the microservice.
[ ]

🎛️ Define model configurations

  • Each ModelConfig defines a model that can be used during the generation process.

  • The "model alias" is used to reference the model in the Data Designer config (as we will see below).

  • The "model provider" is the external service that hosts the model (see the model config docs for more details).

  • By default, the microservice uses build.nvidia.com as the model provider.

[ ]

🏗️ Initialize the Data Designer Config Builder

  • The Data Designer config defines the dataset schema and generation process.

  • The config builder provides an intuitive interface for building this configuration.

  • The list of model configs is provided to the builder at initialization.

[ ]

🎲 Adding Categorical Columns for Controlled Diversity

  • Now we'll add categorical columns to control the diversity of our generated examples

  • Sampler columns offer non-LLM based generation of synthetic data.

  • They are particularly useful for steering the diversity of the generated data, as we demonstrate below.

[ ]

🦜 LLM-generated columns

  • When prompting the LLM, we can use Jinja templating to reference other columns in the dataset.

  • As we see below, nested json fields can be accessed using dot notation.

  • These prompts instruct the LLM to produce the actual empathic reasoning trace and answer, following the specified format with and tags.

🧠 Empathic Reasoning Trace Generation

This column is designed to generate clear, thoughtful reasoning traces that blend logical analysis with emotional insight for everyday situations
where empathy is crucial. The generation prompt is tailored to:

  • Produce a structured explanation that highlights both the practical reasoning and the emotional dynamics at play.

  • Encourage a dual output: one part detailing the empathic thought process (enclosed within <think> tags) and another delivering a
    compassionate final answer (enclosed within <answer> tags).

  • Ensure that the generated content reflects deep understanding, compassion, and a balanced view of the challenges and emotions involved.

[ ]

⚡️ Empathic Reasoning Process Generation

  • These columns generate and evaluate a detailed empathic reasoning trace for addressing everyday scenarios.

  • The process emphasizes a compassionate, thoughtful approach that blends logical reasoning with emotional insight.

  • The prompts instruct the model to include its internal thought process within ... tags before providing the JSON output.

[ ]
[ ]

Final Empathic Reasoning Trace Generation and Evaluation

  • These columns refine and evaluate the final empathic reasoning trace.

  • The final trace is generated by reviewing the scenario, your initial empathic reasoning trace, and its evaluation.

  • The process integrates improvements suggested by the evaluation and ensures that the final reasoning is compassionate, clear, and comprehensive.

  • As always, include your internal thought process wrapped within ... tags before providing the final JSON output.

[ ]

🔁 Iteration is key – preview the dataset!

  1. Use the preview method to generate a sample of records quickly.

  2. Inspect the results for quality and format issues.

  3. Adjust column configurations, prompts, or parameters as needed.

  4. Re-run the preview until satisfied.

[ ]
[ ]

📊 Analyze the generated data

  • Data Designer automatically generates a basic statistical analysis of the generated data.

  • This analysis is available via the analysis property of generation result objects.

[ ]

🆙 Scale up!

  • Happy with your preview data?

  • Use the create method to submit larger Data Designer generation jobs.

[ ]
[ ]
[ ]
[ ]