Notebooks
N
NVIDIA
1 The Basics

1 The Basics

gpu-accelerationretrieval-augmented-generationllm-inferencetensorrtnvidia-generative-ai-exampleslarge-language-modelsmicroservicetriton-inference-serverLLMragmanaged-service-tutorialsnemoNeMo-Data-Designergetting-started

🎨 NeMo Data Designer 101: The Basics

In this notebook, we will demonstrate the basics of Data Designer by generating a simple product review dataset.

💾 Install dependencies

IMPORTANT 👉 If you haven't already, follow the instructions in the README to install the necessary dependencies. Note you may need to restart your kernel after setting up the environment.

[ ]

⚙️ Initialize the NeMo Data Designer Client

  • The data designer client is responsible for submitting generation requests to the Data Designer microservice.

  • In this notebook, we connect to the managed service of data designer. Alternatively, you can connect to your own instance of data designer by following the deployment instructions here.

  • If you have an instance of data designer running locally, you can connect to it as follows

    data_designer_client = DataDesignerClient(client=NeMoMicroservices(base_url="http://localhost:8080"))
    
[ ]
[ ]

🏗️ Initialize the Data Designer Config Builder

  • The Data Designer config defines the dataset schema and generation process.

  • The config builder provides an intuitive interface for building this configuration.

  • You must provide a list of model configs to the builder at initialization.

  • This list contains the models you can choose from (via the model_alias argument) during the generation process.

Note: The NeMo Data Designer Managed service has access to specific models. Please visit https://build.nvidia.com/nemo/data-designer to see the latest list of which models are available.

[ ]
[ ]

🎲 Getting started with sampler columns

  • Sampler columns offer non-LLM based generation of synthetic data.

  • They are particularly useful for steering the diversity of the generated data, as we demonstrate below.


Let's start designing our product review dataset by adding product category and subcategory columns.

[ ]

Next, let's add samplers to generate data related to the customer and their review.

[ ]

🦜 LLM-generated columns

  • The real power of Data Designer comes from leveraging LLMs to generate text, code, and structured data.

  • For our product review dataset, we will use LLM-generated text columns to generate product names and customer reviews.

  • When prompting the LLM, we can use Jinja templating to reference other columns in the dataset.

  • As we see below, nested json columns can be accessed using dot notation.

[ ]

👀 Preview the dataset

  • Iteration is key to generating high-quality synthetic data.

  • Use the preview method to generate 10 records for inspection.

[ ]
[ ]
[ ]

⏭️ Next Steps

Now that you've seen the basics of Data Designer, check out the following notebooks to learn more about: