NVIDIA 1 The Basics

1 The Basics

gpu-accelerationretrieval-augmented-generationllm-inferencetensorrtnvidia-generative-ai-exampleslarge-language-modelsmicroservicetriton-inference-serverLLMragmanaged-service-tutorialsnemoNeMo-Data-Designergetting-started

alph-notebooks/nvidia-generative-ai-examples / 1-the-basics.ipynb

Export

Run Notebooks

Contents

No cells yet

Add cells to see them here

🎨 NeMo Data Designer 101: The Basics

In this notebook, we will demonstrate the basics of Data Designer by generating a simple product review dataset.

💾 Install dependencies

IMPORTANT 👉 If you haven't already, follow the instructions in the README to install the necessary dependencies. Note you may need to restart your kernel after setting up the environment.

[ ]

⚙️ Initialize the NeMo Data Designer Client

The data designer client is responsible for submitting generation requests to the Data Designer microservice.
In this notebook, we connect to the managed service of data designer. Alternatively, you can connect to your own instance of data designer by following the deployment instructions here.

If you have an instance of data designer running locally, you can connect to it as follows

data_designer_client = DataDesignerClient(client=NeMoMicroservices(base_url="http://localhost:8080"))

[ ]

🏗️ Initialize the Data Designer Config Builder

The Data Designer config defines the dataset schema and generation process.
The config builder provides an intuitive interface for building this configuration.
You must provide a list of model configs to the builder at initialization.
This list contains the models you can choose from (via the model_alias argument) during the generation process.

Note: The NeMo Data Designer Managed service has access to specific models. Please visit https://build.nvidia.com/nemo/data-designer to see the latest list of which models are available.

[ ]

🎲 Getting started with sampler columns

Sampler columns offer non-LLM based generation of synthetic data.
They are particularly useful for steering the diversity of the generated data, as we demonstrate below.

Let's start designing our product review dataset by adding product category and subcategory columns.

[ ]

Next, let's add samplers to generate data related to the customer and their review.

[ ]

🦜 LLM-generated columns

The real power of Data Designer comes from leveraging LLMs to generate text, code, and structured data.
For our product review dataset, we will use LLM-generated text columns to generate product names and customer reviews.
When prompting the LLM, we can use Jinja templating to reference other columns in the dataset.
As we see below, nested json columns can be accessed using dot notation.

[ ]

👀 Preview the dataset

Iteration is key to generating high-quality synthetic data.
Use the preview method to generate 10 records for inspection.

[ ]

⏭️ Next Steps

Now that you've seen the basics of Data Designer, check out the following notebooks to learn more about: