NVIDIA Voice Rag Agent Tutorial

Voice Rag Agent Tutorial

nemotron-voice-rag-agent-exampleuse-case-examplesnvidia-nemotron

alph-notebooks/nvidia-nemotron / voice_rag_agent_tutorial.ipynb

Export

Run Notebooks

Contents

No cells yet

Add cells to see them here

Building a Voice-Powered RAG Agent with NVIDIA Nemotron Models

This notebook walks you through building an end-to-end AI agent that combines voice input, multimodal retrieval, safety guardrails, and long-context reasoning using NVIDIA's Nemotron model family.

┌─────────────────────────────────────────────────────────────────────────────┐
│                        VOICE-POWERED RAG AGENT                              │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│   ┌─────────┐    ┌─────────┐    ┌─────────┐    ┌─────────┐    ┌─────────┐   │
│   │  Voice  │───>│   ASR   │───>│   RAG   │───>│   LLM   │───>│ Safety  │   │
│   │  Input  │    │ (NeMo)  │    │ Embed+  │    │ Reason  │    │  Guard  │   │
│   └─────────┘    └─────────┘    │ Rerank  │    └─────────┘    └─────────┘   │
│                                 └─────────┘                         │       │
│                                      │                              │       │
│                                      v                              v       │
│                                 ┌─────────┐                   ┌─────────┐   │
│                                 │  FAISS  │                   │  Safe   │   │
│                                 │  Index  │                   │ Response│   │
│                                 └─────────┘                   └─────────┘   │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Models Used

Component	Model	Deployment
Speech-to-Text	`nemotron-speech-streaming-en-0.6b`	Self-hosted (NeMo)
Embeddings	`llama-nemotron-embed-vl-1b-v2`	Self-hosted (Transformers)
Reranking	`llama-nemotron-rerank-vl-1b-v2`	Self-hosted (Transformers)
Vision-Language	`nemotron-nano-12b-v2-vl`	NVIDIA API
Reasoning	`nemotron-3-nano-30b-a3b`	NVIDIA API
Safety	`Llama-3.1-Nemotron-Safety-Guard-8B-v3`	Self-hosted (Transformers)

Prerequisites

NVIDIA GPU with 24GB+ VRAM (for self-hosted models)
NVIDIA API key (for cloud-hosted reasoning models)
Python 3.10+

Step 1: Environment Setup

Before we begin, we need to install the required dependencies and configure API access.

What gets installed:

LangChain v1.0: Modern agent orchestration with create_agent API
langchain-nvidia-ai-endpoints: ChatNVIDIA integration for NVIDIA API
Transformers + PyTorch: For running local embedding, reranking, and safety models
FAISS: Vector similarity search
NeMo Toolkit: NVIDIA's ASR framework
ipywebrtc: Audio recording widget for Jupyter

# Install core dependencies (LangChain v1.0+)
pip install langchain langchain-nvidia-ai-endpoints faiss-cpu transformers torch pillow requests jinja2 soundfile

# Install NeMo for ASR
pip install nemo_toolkit[asr]

# Install audio recording widget
pip install ipywebrtc
jupyter nbextension enable --py widgetsnbextension

[1]

✅ Environment configured successfully!

[2]

✅ GPU available: NVIDIA RTX 6000 Ada Generation
   Memory: 47.6 GB

Step 2: Ground the Agent with Multimodal RAG

Retrieval-Augmented Generation (RAG) grounds our agent in real data, preventing hallucinations by providing factual context for every response.

┌─────────────────────────────────────────────────────────────────┐
│                    MULTIMODAL RAG PIPELINE                      │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│   INDEXING (Offline)                                            │
│   ┌──────────┐    ┌──────────┐    ┌──────────┐                  │
│   │  Text    │───>│  Embed   │───>│  FAISS   │                  │
│   │  Docs    │    │  Model   │    │  Index   │                  │
│   └──────────┘    └──────────┘    └──────────┘                  │
│   ┌──────────┐         │                                        │
│   │  Images  │─────────┘                                        │
│   └──────────┘                                                  │
│                                                                 │
│   RETRIEVAL (Online)                                            │
│   ┌──────────┐    ┌──────────┐    ┌──────────┐    ┌──────────┐  │
│   │  Query   │───>│  Embed   │───>│  Search  │───>│  Rerank  │  │
│   │          │    │  Query   │    │  Top-K   │    │  Top-N   │  │
│   └──────────┘    └──────────┘    └──────────┘    └──────────┘  │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

2.1 Load the Embedding Model

The llama-nemotron-embed-vl-1b-v2 model creates semantic vector representations of both text and images. This allows us to:

Text-only embedding: Standard document search
Image-only embedding: Search over screenshots, diagrams, slides
Image+Text pairs: Maximum retrieval accuracy for rich documents

The model uses different context lengths for each mode to optimize quality.

[3]

Note: Environment variable`HF_TOKEN` is set and is the current active token independently from the token you've just configured.

✅ Logged into HuggingFace!
Loading embedding model: nvidia/llama-nemotron-embed-vl-1b-v2...
✅ Embedding model loaded!

2.2 Load the Reranking Model

Initial retrieval casts a wide net using fast vector similarity. The reranker then performs deeper query-document interaction to surface the most relevant results.

Why rerank? Embedding-based retrieval is fast but approximate. The reranker reads each candidate document alongside the query, enabling cross-attention between them. This improves accuracy by ~6-7% on benchmarks.

The llama-nemotron-rerank-vl-1b-v2 model handles both text and image documents, using the same multimodal architecture as the embedding model.

[4]

Loading reranking model: nvidia/llama-nemotron-rerank-vl-1b-v2...
✅ Reranking model loaded!

2.3 Build a Sample Knowledge Base

Let's create a small knowledge base with both text and images to demonstrate multimodal retrieval. In production, you would index your actual documents, PDFs, and images here.

Sample topics:

NVIDIA Isaac Lab (robotics)
Autonomous vehicles (NVIDIA DRIVE)
Nemotron 3 Nano architecture
Genomics research (Evo-2)
RAG fundamentals

[5]

✅ Loaded 5 documents
   - With images: 3
   - Text only: 2

[6]

Creating document embeddings...
✅ Created embeddings with shape: (5, 2048)

[7]

✅ FAISS index created with 5 vectors

[8]

Query: How is AI used in robotics?

Top 2 results:
  1. NVIDIA Isaac Lab is a unified framework for robot learning built on Isaac Sim. It provides modular c...
     Has image: True
  2. Autonomous vehicles use AI for perception, planning, and control. NVIDIA DRIVE provides the compute ...
     Has image: True

Step 3: Add Real-Time Speech with Nemotron Speech ASR

Now we add voice input capability. The nemotron-speech-streaming-en-0.6b model converts spoken audio to text with ultra-low latency.

┌─────────────────────────────────────────────────────────────┐
│                    ASR PIPELINE                             │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│   ┌──────────┐    ┌──────────┐    ┌──────────┐              │
│   │  Audio   │───>│  NeMo    │───>│  Text    │              │
│   │  Stream  │    │  ASR     │    │  Output  │              │
│   └──────────┘    └──────────┘    └──────────┘              │
│                        │                                    │
│                        v                                    │
│              ┌─────────────────┐                            │
│              │ + Punctuation   │                            │
│              │ + Capitalization│                            │
│              └─────────────────┘                            │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Key features:

Trained on 285k-hour Granary dataset
7.16% average WER on Open ASR Leaderboard
Cache-aware streaming for real-time applications
Built-in punctuation and capitalization
Configurable latency: 80ms to 1.1s chunk sizes

[9]

[NeMo W 2026-01-06 00:00:29 nemo_logging:405] Megatron num_microbatches_calculator not found, using Apex version.
OneLogger: Setting error_handling_strategy to DISABLE_QUIETLY_AND_REPORT_METRIC_ERROR for rank (rank=0) with OneLogger disabled. To override: explicitly set error_handling_strategy parameter.
No exporters were provided. This means that no telemetry data will be collected.
[NeMo W 2026-01-06 00:00:30 nemo_logging:405] /home/chris/projects/use-case-examples/nemotron-voice-rag-agent-example/.venv/lib/python3.12/site-packages/pydub/utils.py:170: RuntimeWarning: Couldn't find ffmpeg or avconv - defaulting to ffmpeg, but may not work
      warn("Couldn't find ffmpeg or avconv - defaulting to ffmpeg, but may not work", RuntimeWarning)

Loading ASR model: nvidia/nemotron-speech-streaming-en-0.6b...
[NeMo I 2026-01-06 00:00:32 nemo_logging:393] Tokenizer SentencePieceTokenizer initialized with 1024 tokens

[NeMo W 2026-01-06 00:00:32 nemo_logging:405] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    use_lhotse: true
    skip_missing_manifest_entries: true
    input_cfg: /lustre/fs12/portfolios/llmservice/projects/llmservice_nemo_speechlm/users/weiqingw/manifests/input_cfg/am-os_fl_ll_mc_mm_mo_no_su_yo_yt_gsc_en.yaml
    tarred_audio_filepaths: null
    manifest_filepath: null
    sample_rate: 16000
    shuffle: true
    num_workers: 2
    pin_memory: true
    max_duration: 40.0
    min_duration: 0.1
    text_field: answer
    batch_duration: null
    use_bucketing: true
    max_tps:
    - 10.92
    - 11.16
    - 10.68
    - 10.22
    - 9.98
    - 9.67
    - 9.5
    - 9.36
    - 9.04
    - 9.38
    - 8.81
    - 8.78
    - 8.24
    - 8.85
    - 9.25
    bucket_duration_bins:
    - - 5.76
      - 62
    - - 7.12
      - 77
    - - 8.32
      - 83
    - - 9.44
      - 92
    - - 10.5
      - 103
    - - 11.68
      - 111
    - - 12.88
      - 117
    - - 14.08
      - 130
    - - 15.44
      - 138
    - - 17.2
      - 156
    - - 19.36
      - 158
    - - 22.4
      - 189
    - - 26.64
      - 217
    - - 32.8
      - 272
    - - 40.1
      - 352
    bucket_batch_size:
    - 100
    - 100
    - 80
    - 80
    - 50
    - 50
    - 50
    - 50
    - 40
    - 30
    - 20
    - 20
    - 15
    - 10
    - 3
    num_buckets: 15
    bucket_buffer_size: 7500
    shuffle_buffer_size: 5000
    
[NeMo W 2026-01-06 00:00:32 nemo_logging:405] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s). 
    Validation config : 
    use_lhotse: true
    manifest_filepath:
    - /lustre/fsw/portfolios/llmservice/projects/llmservice_nemo_speechlm/data/canary/canary_v0/manifests/data/ASR/MMLPC/en/val_test/mcv11/mcv11_dev_clean_pcstrip_en_2k.json
    sample_rate: 16000
    batch_size: 4
    shuffle: false
    max_duration: 40.0
    min_duration: 0.1
    num_workers: 2
    pin_memory: true
    text_field: answer

[NeMo I 2026-01-06 00:00:32 nemo_logging:393] PADDING: 0
[NeMo I 2026-01-06 00:00:34 nemo_logging:393] Using RNNT Loss : warprnnt_numba
    Loss warprnnt_numba_kwargs: {'fastemit_lambda': 0.005, 'clamp': -1.0}
[NeMo I 2026-01-06 00:00:34 nemo_logging:393] Using RNNT Loss : warprnnt_numba
    Loss warprnnt_numba_kwargs: {'fastemit_lambda': 0.005, 'clamp': -1.0}
[NeMo I 2026-01-06 00:00:34 nemo_logging:393] Using RNNT Loss : warprnnt_numba
    Loss warprnnt_numba_kwargs: {'fastemit_lambda': 0.005, 'clamp': -1.0}
[NeMo I 2026-01-06 00:00:35 nemo_logging:393] Model EncDecRNNTBPEModel was successfully restored from /home/chris/.cache/huggingface/hub/models--nvidia--nemotron-speech-streaming-en-0.6b/snapshots/e730059607cecd9cccf501d8a39f5d22f0993db8/nemotron-speech-streaming-en-0.6b.nemo.
  → Disabled CUDA graphs on decoding_computer
✅ ASR model loaded!

[10]

[NeMo W 2026-01-06 00:00:38 nemo_logging:405] The following configuration keys are ignored by Lhotse dataloader: use_start_end_token
[NeMo W 2026-01-06 00:00:38 nemo_logging:405] You are using a non-tarred dataset and requested tokenization during data sampling (pretokenize=True). This will cause the tokenization to happen in the main (GPU) process,possibly impacting the training speed if your tokenizer is very large.If the impact is noticable, set pretokenize=False in dataloader config.(note: that will disable token-per-second filtering and 2D bucketing features)
Transcribing: 1it [00:00, 12.34it/s]


📝 Transcription: Hypothesis(score=-465.7001953125, y_sequence=tensor([112, 127,  41, 685, 342, 291,  32, 120, 143, 160, 358, 963,  54, 589,
        977]), text='Could you please tell me about robotics?', dec_out=None, dec_state=None, timestamp=[], alignments=None, frame_confidence=None, token_confidence=None, word_confidence=None, length=0, y=None, lm_state=None, lm_scores=None, ngram_lm_state=None, tokens=None, last_token=None, token_duration=None, last_frame=None)

Step 4: Enforce Safety with Nemotron Content Safety and PII Detection

Production agents need guardrails. The Llama-3.1-Nemotron-Safety-Guard-8B-v3 model checks both user inputs and agent outputs for safety violations.

┌─────────────────────────────────────────────────────────────┐
│                    SAFETY PIPELINE                          │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│   ┌──────────┐         ┌──────────────────┐                 │
│   │  User    │────────>│                  │                 │
│   │  Query   │         │   Safety Guard   │──> Safe/Unsafe  │
│   └──────────┘         │                  │                 │
│                        │  23 Categories:  │                 │
│   ┌──────────┐         │  - Violence      │                 │
│   │  Agent   │────────>│  - PII/Privacy   │                 │
│   │ Response │         │  - Harassment    │                 │
│   └──────────┘         │  - Fraud         │                 │
│                        │  - Malware       │                 │
│                        │  - etc.          │                 │
│                        └──────────────────┘                 │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Key features:

Multilingual support (20+ languages)
PII detection (emails, SSNs, phone numbers)
Cultural context awareness
23 safety categories
Works with noisy ASR output

[11]

Loading safety model: nvidia/Llama-3.1-Nemotron-Safety-Guard-8B-v3...

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

✅ Safety model loaded!

[12]

The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.

🛡️ Safety Check Results:

The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


1. Query: How does AI improve robotics?...
   Result: {'User Safety': 'safe', 'Response Safety': 'safe'}

2. Query: My email is john@example.com and my SSN is 123-45-...
   Result: {'User Safety': 'unsafe', 'Safety Categories': 'PII/Privacy'}

Step 5: Add Long-Context Reasoning with Nemotron 3 Nano

With retrieval, speech, and safety in place, we add the reasoning engine. Nemotron 3 Nano processes the retrieved context and generates intelligent responses.

┌─────────────────────────────────────────────────────────────┐
│                 REASONING PIPELINE                          │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│   ┌──────────┐                                              │
│   │ Retrieved│─────┐                                        │
│   │   Docs   │     │                                        │
│   └──────────┘     │    ┌────────────────┐                  │
│                    ├───>│                │                  │
│   ┌──────────┐     │    │  Nemotron 3    │    ┌──────────┐  │
│   │  Image   │─────┤    │     Nano       │───>│ Response │  │
│   │  Descs   │     │    │                │    └──────────┘  │
│   └──────────┘     │    │  1M tokens     │                  │
│                    │    │  Mamba+Trans   │                  │
│   ┌──────────┐     │    └────────────────┘                  │
│   │   User   │─────┘           │                            │
│   │  Query   │          ┌──────┴──────┐                     │
│   └──────────┘          │  Optional   │                     │
│                         │  Thinking   │                     │
│                         │    Mode     │                     │
│                         └─────────────┘                     │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Architecture highlights:

1M token context: Fit entire document collections in a single request
Mamba-Transformer hybrid: Efficient inference on long sequences
Thinking mode: Optional step-by-step reasoning for complex queries

For images in retrieved documents, we first use Nemotron Nano VL to describe them, then include those descriptions in the context for Nemotron 3 Nano.

[13]

✅ Nemotron LLM initialized!

[14]

🤖 Response: 
Isaac Lab is used as a unified framework for robot learning that enables the development and testing of AI‑driven behaviors—such as locomotion, manipulation, and navigation—through modular components built on NVIDIA Isaac Sim.

Step 6: Build a Voice-Powered LangChain v1.0 Agent with RAG

Now we'll create a LangChain v1.0 agent using the modern create_agent API. The agent automatically loops, calling the RAG tool as needed until it can answer your voice query.

┌─────────────────────────────────────────────────────────────────────────────┐
│                   LANGCHAIN v1.0 AGENT PIPELINE                             │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│   🎤 Voice Input (Microphone)                                               |
│       │                                                                     │
│       v                                                                     │
│   ASR (Nemotron) → Text Query                                               │
│       │                                                                     │
│       v                                                                     │
│   🛡️ STEP 2: Safety Check (Input) ─────────────────────> ❌ REJECT         │
│       │                                                                     │
│       v (if safe)                                                           │
│   ┌────────────────────────────────────────────────────────────────┐        │
│   │  STEP 3: create_agent Loop (CompiledStateGraph)                │        │
│   │                                                                │        │
│   │  Agent (ChatNVIDIA) ──> Decides: Need RAG?                     │        │
│   │     │                │                                         │        │
│   │     NO               YES                                       │        │
│   │     │                │                                         │        │
│   │     │                v                                         │        │
│   │     │           RAG Tool ───────────────────────┐              │        │
│   │     │           │                               │              │        │
│   │     │           v                               │              │        │
│   │     │       Has Images? ──YES──> VLM Describe   │              │        │
│   │     │           │                    │          │              │        │
│   │     │           NO                   v          │              │        │
│   │     │           │           Add to Context      │              │        │
│   │     │           └──────────────────────┬────────┘              │        │
│   │     │                                   │                      │        │
│   │     v                                   v                      │        │
│   │  Generate Response <────────────────────┘                      │        │
│   │                                                                │        │
│   │  (Loop continues until agent is satisfied)                     │        │
│   └────────────────────────────────────────────────────────────────┘        │
│       │                                                                     │
│       v                                                                     │
│   🛡️ STEP 4: Safety Check (Output) ─────────────────────> ❌ FILTER        │
│       │                                                                     │
│       v (if safe)                                                           │
│   ✅ STEP 5: Return Safe Response                                           │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Key Features:

Uses langchain.agents.create_agent (LangChain v1.0 API)
Returns CompiledStateGraph that auto-loops until complete
Integrates with ChatNVIDIA for NVIDIA API endpoints
Uses VLM when images are retrieved in context
Safety ALWAYS enforced on input AND output

First, define the RAG tool:

[15]

✅ RAG tool defined!
   Function: search_knowledge_base
   Capabilities: Multimodal retrieval + reranking + image description

[16]

✅ LangChain v1.0 Agent created!
   API: langchain.agents.create_agent (LangChain v1.0)
   Model: nvidia/nemotron-3-nano-30b-a3b via ChatNVIDIA
   Tools: search_knowledge_base
   Returns: CompiledStateGraph (auto-loops until complete)

[17]

✅ Voice RAG Agent initialized!

Pipeline Flow:
  1. 🎤 Voice Input → Nemotron ASR → Text
  2. 🛡️ Safety Check (input) → Reject if unsafe
  3. 🤖 Agent Loop (LangChain v1.0 create_agent)
     └─ RAG Tool → Embed + Rerank + VLM (if images)
  4. 🛡️ Safety Check (output) → Filter if unsafe
  5. ✅ Return safe response

Interactive Voice Interface

Now let's create an interactive microphone recorder to query the agent with your voice!

[ ]

======================================================================
📁 OPTION 1: Upload Audio File (Recommended for SSH/Remote)
======================================================================
Upload a .wav, .mp3, .flac, .webm, or .ogg file from your local machine.

FileUpload(value=(), accept='.wav,.mp3,.flac,.webm,.ogg,.m4a', description='Upload Audio', layout=Layout(width…


======================================================================
🎙️ OPTION 2: Browser Microphone (Local Jupyter Only)
======================================================================
⚠️  This requires localhost or HTTPS. Will show 'Permission denied' over SSH.


======================================================================
▶️  PROCESS AUDIO
======================================================================

Checkbox(value=False, description='Use sample audio (robotics.flac) for testing', layout=Layout(width='400px')…

Button(button_style='success', description='🎤 Process Audio', layout=Layout(height='50px', width='300px'), sty…

Output()

Summary

You've built a voice-powered LangChain v1.0 agent with RAG using NVIDIA Nemotron models:

┌─────────────────────────────────────────────────────────────────────────────┐
│                     VOICE RAG AGENT STACK (LangChain v1.0)                  │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│   Component          Model                           Purpose                │
│   ─────────          ─────                           ───────                │
│   🎤 ASR             nemotron-speech-streaming       Voice → Text           │
│   📚 Embeddings      llama-nemotron-embed-vl         Semantic search        │
│   🔄 Reranking       llama-nemotron-rerank-vl        Sharpen accuracy       │
│   🖼️ Vision (VLM)    nemotron-nano-12b-vl            Describe images        │ 
│   🤖 Agent LLM       nemotron-3-nano-30b             Agent reasoning        │
│   🛡️ Safety          Llama-3.1-Nemotron-Safety       Input/Output checks    │
│                                                                             │
│   Architecture:      LangChain v1.0 Agent (CompiledStateGraph)              │
│   API:               langchain.agents.create_agent                          │
│   Model:             ChatNVIDIA (langchain_nvidia_ai_endpoints)             │
│   Tools:             - search_knowledge_base (RAG + VLM on-demand)          │
│   Input:             🎙️ Voice only (microphone)                             │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

The 5-Step Pipeline

┌───────────────────────────────────────────────────────────────┐
│  STEP 1: Voice Input                                          │
│  🎤 Microphone → Nemotron ASR → Text Query                   │
└───────────────────────────────────────────────────────────────┘
                              │
                              v
┌───────────────────────────────────────────────────────────────┐
│  STEP 2: Input Safety Check                                   │
│  🛡️ Nemotron Safety Guard validates input                     │
│      ├── UNSAFE → ❌ Reject immediately                       │
│      └── SAFE   → ✅ Continue to agent                        │
└───────────────────────────────────────────────────────────────┘
                              │
                              v
┌───────────────────────────────────────────────────────────────┐
│  STEP 3: Agent Processing (create_agent loop)                 │
│  🤖 Agent decides: Need more information?                     │
│      ├── YES → Call RAG Tool                                  │
│      │         └── Retrieve + Rerank docs                     │
│      │         └── If images → VLM describes them             │
│      │         └── Return context to agent                    │
│      │         └── Loop back to decide again                  │
│      └── NO  → Generate final response                        │
└───────────────────────────────────────────────────────────────┘
                              │
                              v
┌───────────────────────────────────────────────────────────────┐
│  STEP 4: Output Safety Check                                  │
│  🛡️ Nemotron Safety Guard validates response                  │
│      ├── UNSAFE → ❌ Filter/block response                    │
│      └── SAFE   → ✅ Continue to output                       │
└───────────────────────────────────────────────────────────────┘
                              │
                              v
┌───────────────────────────────────────────────────────────────┐
│  STEP 5: Return Safe Response                                 │
│  ✅ Deliver grounded, safe response to user                   │
└───────────────────────────────────────────────────────────────┘

LangChain v1.0 Implementation Details

Modern create_agent API:

✅ Uses langchain.agents.create_agent (LangChain v1.0)
✅ Returns CompiledStateGraph that auto-loops
✅ Integrates with ChatNVIDIA for NVIDIA endpoints
✅ Simple message format: {"messages": [("user", query)]}

Voice-First Design:

🎤 HTML5 MediaRecorder for browser microphone access
🎙️ No webcam - audio only
📝 Automatic transcription with Nemotron ASR

Safety & Quality:

🛡️ Input safety check (before agent runs)
🛡️ Output safety check (after agent completes)
🖼️ VLM processes images when retrieved
🔍 Multimodal RAG (text + images)
🎯 Reranking for accuracy (+6-7%)

Next Steps

Add more tools: Web search, calculators, code execution
Enable streaming: Use agent.stream() for real-time responses
Add memory: Conversation history across multiple queries
Deploy: Use NVIDIA NIM for production inference
Custom data: Index your own documents and images