NVIDIA Japanese Commonsense Qa Data Generator Nemotron Persona Jp Seed

Japanese Commonsense Qa Data Generator Nemotron Persona Jp Seed

gpu-accelerationretrieval-augmented-generationllm-inferencetensorrtnvidia-generative-ai-examplesself-hosted-tutorialslarge-language-modelsmicroservicetriton-inference-servercommunity-contributionsLLMnemotron-persona-jpragnemoNeMo-Data-Designer

alph-notebooks/nvidia-generative-ai-examples / japanese_commonsense_qa_data_generator_nemotron_persona_jp_seed.ipynb

Export

Run Notebooks

Contents

No cells yet

Add cells to see them here

🎨 NeMo Data Designer: Japanese Commonsense Reasoning Dataset Generation

📚 Overview

This notebook generates synthetic datasets for the following tasks using NeMo Data Designer:

jcommonsenseqa: Japanese commonsense question answering

Seed Data: Uses nvidia/Nemotron-Personas-Japan directly as the dataset

👋 Important – Environment Setup

Ensure that NeMo Data Designer installation and configuration are completed

Ensure that the local LLM server is running

📦 Import Required Modules

[1]

⚙️ Initialize NeMo Data Designer Client

[2]

🎛️ Define Model Configuration

[3]

📊 Prepare Seed Data

Load persona data from nvidia/Nemotron-Personas-Japan and pass it to Data Designer as a pandas DataFrame.

[4]

Define Target Count and Category Breakdown

Define target of 2000 total seeds with category-specific breakdowns.

SEED_TARGET: 2000 total seeds
WeakA: geo(250), tools(100), public(200), other(150) = 400 total
WeakB (weakness reinforcement): finance(400), safety(350), vocab(350) = 1100 total
Typical: Remaining 500 seeds
Bias suppression: Max 10 per occupation, max 12 per prefecture

[5]

Handle Missing Values

Fill missing values in required columns with empty strings. Create columns with empty strings if they don't exist.

Text Construction

Combine multiple columns to construct text for classification.

_all_text: Combine all columns
_core_text: Combine core columns only (primary target for keyword matching)
_core_len: Character count of core text

[6]

Create Duplicate Suppression Key

Create a key (_attr_key) for duplicate detection based on attribute combinations. This prevents selecting multiple similar personas.

Exclude completely empty keys (all fields empty).

[7]

Exclude by Negative Keywords

Exclude personas containing inappropriate keywords (extreme expressions, crime-related, etc.) unsuitable for JCommonsenseQA. Evaluate _core_text and remove matching entries.

[8]

Define Category Keyword Dictionaries

Define keywords for transportation/movement, daily life/housework, and tools.

geo_kw: A_Transportation/Movement (trains, stations, buses, walking, etc.)
life_kw: F_Daily Life/Housework (cooking, cleaning, shopping, etc.)
tools_kw: B_Tools/Usage (knives, vacuum cleaners, stationery, etc.)

Define keywords for public facilities/manners, culture/etiquette, and finance.

public_kw: D_Public Facilities/Manners (lines, order, priority seats, etc.)
culture_kw: D_Public Facilities/Manners (etiquette, ceremonies, etc.)
finance_kw: C_Payment/Money (accounting, banking, card payments, etc.)

[9]

Calculate Keyword Scores

Score how many keywords from each category are contained in the persona text.

Primarily calculate scores using _core_text
geo/tools only: recalculate by adding supplementary text (travel_persona, hobbies, etc.)
- This prevents depletion of geo and tools categories and suppresses misclassification to public

Exclude Abnormal Scores and Estimate Categories

Exclude data with abnormally high scores (containing unnaturally many keywords) and estimate the most suitable category for each persona.

Select the category with the highest score
In case of ties, decide by priority (finance > safety > vocab > ...)

[10]

Determine Neutral Data

Classify personas with few keyword hits and shorter length as Neutral.

Conditions:

Core text length is 260 characters or less
Keyword hit count is 0
Does not contain definition keywords (such as '~とは')

Limit Neutral to 50 entries to prevent too many thin seeds.

Create Sampling Pools

Create sampling pools for each category.

typical_pool: Neutral and thin data (max_score ≤ 2)
weakB_pool: Reinforcement targets (finance, safety, vocab)
geo_pool, tools_pool, public_pool, other_pool: Each sub-category of WeakA

[11]

Define Sampling Function with Caps

Function that samples while suppressing bias by occupation and prefecture.

Operation:

First sample while respecting caps
If insufficient, relax caps to fill the remainder
Always ensure the specified count is met

[12]

Sample WeakB Categories

Sample weakness reinforcement targets (WeakB).

finance: 400 entries
safety: 350 entries
vocab: 350 entries

Sample a total of 1100 entries, excluding already selected data from subsequent sampling.

WeakA - Sample Geo/Tools

Sample transportation/movement and tools categories from WeakA.

geo (Transportation/Movement): 250 entries
tools (Tools): 100 entries

WeakA - Sample Public/Other

Sample the remainder of WeakA.

public (Public Facilities/Manners): 200 entries
other (culture/life): 150 entries
- For Other, prioritize those with public facility-related keywords
- Suppress those with religion-related keywords (penalty)
- This prevents category D from being biased toward religion

Sample Typical and Final Adjustments

Fill the remaining slots (approx. 500 entries) from the Typical category.

Process:

Sample remaining count from Typical pool
Combine all parts
If insufficient, add from unused data
If excess, adjust to 2000 entries
Always ensure exactly 2000 entries

[13]

Assign Themes and Check Distribution

Map categories to JCommonsenseQA themes (A-F, N) and check the final distribution.

Themes:

A: Transportation/Movement
B: Tools/Usage
C: Payment/Money
D: Public Facilities/Manners
E: Safety/Danger
F: Daily Life/Housework
N: Neutral

[14]

[seed_jc] size: 2000
jc_category
finance    407
vocab      364
safety     352
geo        320
public     208
culture    145
tools      102
life       102
Name: count, dtype: int64
jc_theme
B_道具・用途         466
C_支払い・お金        407
D_公共施設・マナー手順    353
E_安全・危険         352
A_交通・移動         320
F_生活・家事         102
Name: count, dtype: int64

Create and Save Final Output Data

Select columns needed for prompt generation and create the final seed data.

Output Columns:

uuid, occupation, prefecture, region, marital_status
age_band, skills_and_expertise_list
jc_theme, jc_category, _attr_key

[15]

Theme-Based Topic Category Assignment

This code probabilistically assigns topic_category to each seed persona based on JCommonsenseQA themes (A-F, N).

Main Features

Define weights for each theme
- Define assignment probabilities for topic categories (transportation, public places, daily life, etc.) for each theme (transportation/movement, tools/usage, etc.)
- Example: A_Transportation/Movement → Transportation 60%, Public places 25%, Daily life 15%
Deterministic random number generation
- stable_u01(): Always generates the same 0-1 value from UUID/attribute key (using MD5 hash)
- Guarantees complete reproducibility as it always returns the same result for the same input
Weighted probability selection
- pick_weighted(): Selects categories based on weights using cumulative probability method
- Determines appropriate topic category based on 0-1 random value
Automatic key selection
- Uses uuid column if it exists, otherwise uses _attr_key
- Identifier for assigning unique stable random numbers to each row

Process Flow

Row data → Get jc_theme → Get weights for theme 
→ Generate hash value from UUID (0-1) → Weighted selection → Assign topic_category

Execution Result

A topic_category column is added to each seed persona, assigning topic categories (transportation, public places, daily life, shopping, school, meals, workplace) according to the theme.

[16]

CSV Text Normalization and Unterminated Quote Detection

Clean and normalize dataframe text in advance to prevent issues with unterminated quotes and control characters during CSV file output.

Overview

This code combines 4 normalization processes to clean up data:

Unify newline codes - Convert CRLF/CR to LF for cross-platform compatibility
Remove control characters - Delete invisible characters that hinder CSV parsing (preserve tabs and newlines)
Unicode safety - Handle corrupted characters and isolated surrogates
Quote escaping - Double quotes as needed (usually handled automatically by csv.writer)

Apply the clean_cell() function to all cells in all columns to convert the entire dataframe to a safe state.

Unterminated Quote Detection

The is_potential_unterminated_quote() function warns when the number of double quote occurrences in each row is odd. While not a complete detection, it functions as an inexpensive primary screening before CSV writing and can detect potential syntax errors early.

[17]

Filter Unterminated Quotes and Output CSV

Detect and exclude problematic rows with potential unterminated quotes, outputting only clean seed data to CSV file. Apply the is_potential_unterminated_quote() function to all rows to identify suspicious rows (odd number of double quotes), and extract only safe rows by logical inversion of the mask. Then remove judgment flag columns used in intermediate processing (is_neutral, _has_definition, has_neg_jc) to clean up the data, and save it in CSV format as the final seed data. This prevents CSV syntax errors and parsing failures, ensuring safety in downstream processing.

Note: The variable name is suspects, but it actually contains clean data after excluding suspicious rows.

[18]

🏗️ Define Data Structures

Data Structure for jcommonsenseqa

[19]

📝 Configuration 1: Using Seed Data

Uses the persona dataset directly as seed, and Data Designer automatically samples columns from the dataset.

[20]

[21]

[15:03:51] [INFO] 🔄 Uploading seed dataset to datastore

Upload 0 LFS files: 0it [00:00, ?it/s]

[22]

Add jcommonsenseqa Generated Columns

Seed data columns can be directly referenced in the format {{ column_name }}.

[23]

jcommonsenseqa生成カラムを追加しました

🔍 Quality Evaluation Setup

Evaluate data quality using LLM-as-a-Judge

[24]

品質評価カラムを追加しました

🔁 Generate Preview

First check quality with a small amount of data

[25]

[15:03:56] [INFO] ✅ Validation passed
[15:03:56] [INFO] 🚀 Starting preview generation
[15:03:56] [INFO] ⛓️ Sorting column configs into a Directed Acyclic Graph
[15:03:56] [INFO] 🩺 Running health checks for models...



Seedデータあり版のプレビューを生成中...
==========

[15:03:57] [INFO]   |-- 👀 Checking 'openai/gpt-oss-120b' in provider named 'nvidiabuild' for model alias 'gpt-oss-120b'...
[15:03:57] [INFO]   |-- ✅ Passed!
[15:03:58] [INFO]   |-- 👀 Checking 'openai/gpt-oss-120b' in provider named 'nvidiabuild' for model alias 'quality-judge'...
[15:03:58] [INFO]   |-- ✅ Passed!
[15:03:58] [INFO] ⏳ Processing batch 1 of 1
[15:03:58] [INFO] 🌱 Sampling 1 records from seed dataset
[15:03:58] [INFO]   |-- seed dataset size: 2000 records
[15:03:58] [INFO]   |-- sampling strategy: ordered
[15:03:58] [INFO] 🗂️ Preparing llm-structured column generation
[15:03:58] [INFO]   |-- column name: 'jcqa_data'
[15:03:58] [INFO]   |-- model config:
{
    "alias": "gpt-oss-120b",
    "model": "openai/gpt-oss-120b",
    "inference_parameters": {
        "temperature": 0.9,
        "top_p": 0.95,
        "max_tokens": 2048,
        "max_parallel_requests": 8,
        "timeout": 1200,
        "extra_body": null
    },
    "provider": "nvidiabuild"
}
[15:04:03] [INFO] 🐙 Processing llm-structured column 'jcqa_data' with 8 concurrent workers
[15:04:05] [INFO] ⚖️ Preparing llm-judge column generation
[15:04:05] [INFO]   |-- column name: 'quality_metrics'
[15:04:05] [INFO]   |-- model config:
{
    "alias": "quality-judge",
    "model": "openai/gpt-oss-120b",
    "inference_parameters": {
        "temperature": 0.3,
        "top_p": 0.9,
        "max_tokens": 1024,
        "max_parallel_requests": 4,
        "timeout": 1500,
        "extra_body": null
    },
    "provider": "nvidiabuild"
}
[15:04:10] [INFO] 🐙 Processing llm-judge column 'quality_metrics' with 4 concurrent workers
[15:04:10] [INFO] 🧩 Generating column `clarity_score` from expression
[15:04:10] [INFO] 🧩 Generating column `difficulty` from expression
[15:04:10] [INFO] 📊 Model usage summary:
{
    "openai/gpt-oss-120b": {
        "token_usage": {
            "prompt_tokens": 1653,
            "completion_tokens": 563,
            "total_tokens": 2216
        },
        "request_usage": {
            "successful_requests": 1,
            "failed_requests": 0,
            "total_requests": 1
        },
        "tokens_per_second": 187,
        "requests_per_minute": 5
    }
}
[15:04:10] [INFO] 📐 Measuring dataset column statistics:
[15:04:10] [INFO]   |-- 🌱 column: 'uuid'
[15:04:10] [INFO]   |-- 🌱 column: 'professional_persona'
[15:04:10] [INFO]   |-- 🌱 column: 'sports_persona'
[15:04:10] [INFO]   |-- 🌱 column: 'arts_persona'
[15:04:10] [INFO]   |-- 🌱 column: 'travel_persona'
[15:04:10] [INFO]   |-- 🌱 column: 'culinary_persona'
[15:04:10] [INFO]   |-- 🌱 column: 'persona'
[15:04:10] [INFO]   |-- 🌱 column: 'cultural_background'
[15:04:10] [INFO]   |-- 🌱 column: 'skills_and_expertise'
[15:04:10] [INFO]   |-- 🌱 column: 'skills_and_expertise_list'
[15:04:10] [INFO]   |-- 🌱 column: 'hobbies_and_interests'
[15:04:10] [INFO]   |-- 🌱 column: 'hobbies_and_interests_list'
[15:04:10] [INFO]   |-- 🌱 column: 'career_goals_and_ambitions'
[15:04:10] [INFO]   |-- 🌱 column: 'sex'
[15:04:10] [INFO]   |-- 🌱 column: 'age'
[15:04:10] [INFO]   |-- 🌱 column: 'marital_status'
[15:04:10] [INFO]   |-- 🌱 column: 'education_level'
[15:04:10] [INFO]   |-- 🌱 column: 'occupation'
[15:04:10] [INFO]   |-- 🌱 column: 'region'
[15:04:10] [INFO]   |-- 🌱 column: 'area'
[15:04:10] [INFO]   |-- 🌱 column: 'prefecture'
[15:04:10] [INFO]   |-- 🌱 column: 'country'
[15:04:10] [INFO]   |-- 🌱 column: 'age_band'
[15:04:10] [INFO]   |-- 🌱 column: '_all_text'
[15:04:10] [INFO]   |-- 🌱 column: '_core_text'
[15:04:10] [INFO]   |-- 🌱 column: '_core_len'
[15:04:10] [INFO]   |-- 🌱 column: '_attr_key'
[15:04:10] [INFO]   |-- 🌱 column: 'score_finance'
[15:04:10] [INFO]   |-- 🌱 column: 'score_safety'
[15:04:10] [INFO]   |-- 🌱 column: 'score_vocab'
[15:04:10] [INFO]   |-- 🌱 column: 'score_public'
[15:04:10] [INFO]   |-- 🌱 column: 'score_tools'
[15:04:10] [INFO]   |-- 🌱 column: 'score_life'
[15:04:10] [INFO]   |-- 🌱 column: 'score_geo'
[15:04:10] [INFO]   |-- 🌱 column: 'score_culture'
[15:04:10] [INFO]   |-- 🌱 column: '_geo_text'
[15:04:10] [INFO]   |-- 🌱 column: '_tools_text'
[15:04:10] [INFO]   |-- 🌱 column: '_kw_hits'
[15:04:10] [INFO]   |-- 🌱 column: 'jc_category'
[15:04:10] [INFO]   |-- 🌱 column: 'max_score_any'
[15:04:10] [INFO]   |-- 🌱 column: '_public_bonus'
[15:04:10] [INFO]   |-- 🌱 column: '_religion_pen'
[15:04:10] [INFO]   |-- 🌱 column: 'jc_theme'
[15:04:10] [INFO]   |-- 🌱 column: 'topic_category'
[15:04:10] [INFO]   |-- 🗂️ column: 'jcqa_data'
[15:04:10] [INFO]   |-- ⚖️ column: 'quality_metrics'
[15:04:10] [INFO]   |-- 🧩 column: 'clarity_score'
[15:04:10] [INFO]   |-- 🧩 column: 'difficulty'
[15:04:10] [INFO] ✅ Preview complete!


プレビュー生成完了!

[26]


プレビューデータの分析:

[27]


プレビューデータの最初の数件:
                               uuid  \
0  749db6e7c2e245b2ae3b46aa12c4f1e0   

                                professional_persona  \
0  中嶋 仁子は保険契約のリスク評価と顧客ニーズの体系的分析に長年従事し、退職後もメンタリングと...   

                                      sports_persona  \
0  中嶋 仁子は季節に合わせたウォーキングとコミュニティの軽運動クラスで体力維持を図り、競争的な...   

                                        arts_persona  \
0  中嶋 仁子は茶道と書道の伝統的稽古を基盤に、デジタル墨絵やインタラクティブ茶室体験といった非...   

                                      travel_persona  \
0  中嶋 仁子は近郊の歴史的寺院や季節の農産物直売所への日帰り訪問を計画し、列車予約と宿泊先のロ...   

                                    culinary_persona  \
0  中嶋 仁子は季節の根菜と海藻を使用した低塩和食を好み、抹茶と煎茶の抽出時間を微調整しながら、...   

                                             persona  \
0  中嶋 仁子は組織的なリスク管理と健康志向の生活習慣を統合し、オープンマインドと計画性で高齢者...   

                                 cultural_background  \
0  三重県出身で近畿地方特有の温かい人情と、年長者への敬意を重んじる価値観を持ち、真面目さと健康...   

                                skills_and_expertise  \
0  保険契約の管理と更新、リスク評価・顧客のニーズ把握、法規制の遵守に加えて、ExcelやWor...   

                         skills_and_expertise_list  ... _public_bonus  \
0  ['保険契約管理', 'リスク評価', '顧客対応', '法規制遵守', 'Excel業務']  ...          None   

  _religion_pen  jc_theme topic_category  \
0          None  C_支払い・お金           公共の場   

                                           jcqa_data  \
0  {'answer_index': 0, 'choice0': '現金で支払う', 'choi...   

                          jcqa_data__reasoning_trace  \
0  We need to output JSON with fields: question, ...   

                                     quality_metrics  \
0  {'difficulty': {'reasoning': '日本の駅窓口での支払い方法は広く...   

                    quality_metrics__reasoning_trace clarity_score difficulty  
0  We need to evaluate the generated data's quali...            明確        易しい  

[1 rows x 50 columns]

🆙 Generate Production Data

If no issues in preview, generate a large-scale dataset

[ ]

📊 Analyze Results

[ ]

📈 Quality Comparison

Compare quality with and without seed data

[ ]

💾 Save jcommonsenseqa Data

[ ]

📋 Summary

What We Did

✅ Correctly configured nvidia/Nemotron-Personas-Japan as seed data
✅ Generated data by directly referencing seed data columns
✅ Generated synthetic data for jcommonsenseqa
✅ Created with seed data
✅ Quality evaluation using LLM-as-a-Judge
✅ Generated quality comparison report

[ ]