Notebooks
N
NVIDIA
Voice Rag Agent Tutorial

Voice Rag Agent Tutorial

nemotron-voice-rag-agent-exampleuse-case-examplesnvidia-nemotron

Building a Voice-Powered RAG Agent with NVIDIA Nemotron Models

This notebook walks you through building an end-to-end AI agent that combines voice input, multimodal retrieval, safety guardrails, and long-context reasoning using NVIDIA's Nemotron model family.

┌─────────────────────────────────────────────────────────────────────────────┐
│                        VOICE-POWERED RAG AGENT                              │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│   ┌─────────┐    ┌─────────┐    ┌─────────┐    ┌─────────┐    ┌─────────┐   │
│   │  Voice  │───>│   ASR   │───>│   RAG   │───>│   LLM   │───>│ Safety  │   │
│   │  Input  │    │ (NeMo)  │    │ Embed+  │    │ Reason  │    │  Guard  │   │
│   └─────────┘    └─────────┘    │ Rerank  │    └─────────┘    └─────────┘   │
│                                 └─────────┘                         │       │
│                                      │                              │       │
│                                      v                              v       │
│                                 ┌─────────┐                   ┌─────────┐   │
│                                 │  FAISS  │                   │  Safe   │   │
│                                 │  Index  │                   │ Response│   │
│                                 └─────────┘                   └─────────┘   │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Models Used

ComponentModelDeployment
Speech-to-Textnemotron-speech-streaming-en-0.6bSelf-hosted (NeMo)
Embeddingsllama-nemotron-embed-vl-1b-v2Self-hosted (Transformers)
Rerankingllama-nemotron-rerank-vl-1b-v2Self-hosted (Transformers)
Vision-Languagenemotron-nano-12b-v2-vlNVIDIA API
Reasoningnemotron-3-nano-30b-a3bNVIDIA API
SafetyLlama-3.1-Nemotron-Safety-Guard-8B-v3Self-hosted (Transformers)

Prerequisites

  • NVIDIA GPU with 24GB+ VRAM (for self-hosted models)
  • NVIDIA API key (for cloud-hosted reasoning models)
  • Python 3.10+

Step 1: Environment Setup

Before we begin, we need to install the required dependencies and configure API access.

What gets installed:

  • LangChain v1.0: Modern agent orchestration with create_agent API
  • langchain-nvidia-ai-endpoints: ChatNVIDIA integration for NVIDIA API
  • Transformers + PyTorch: For running local embedding, reranking, and safety models
  • FAISS: Vector similarity search
  • NeMo Toolkit: NVIDIA's ASR framework
  • ipywebrtc: Audio recording widget for Jupyter
# Install core dependencies (LangChain v1.0+)
pip install langchain langchain-nvidia-ai-endpoints faiss-cpu transformers torch pillow requests jinja2 soundfile

# Install NeMo for ASR
pip install nemo_toolkit[asr]

# Install audio recording widget
pip install ipywebrtc
jupyter nbextension enable --py widgetsnbextension
[1]
✅ Environment configured successfully!
[2]
✅ GPU available: NVIDIA RTX 6000 Ada Generation
   Memory: 47.6 GB

Step 2: Ground the Agent with Multimodal RAG

Retrieval-Augmented Generation (RAG) grounds our agent in real data, preventing hallucinations by providing factual context for every response.

┌─────────────────────────────────────────────────────────────────┐
│                    MULTIMODAL RAG PIPELINE                      │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│   INDEXING (Offline)                                            │
│   ┌──────────┐    ┌──────────┐    ┌──────────┐                  │
│   │  Text    │───>│  Embed   │───>│  FAISS   │                  │
│   │  Docs    │    │  Model   │    │  Index   │                  │
│   └──────────┘    └──────────┘    └──────────┘                  │
│   ┌──────────┐         │                                        │
│   │  Images  │─────────┘                                        │
│   └──────────┘                                                  │
│                                                                 │
│   RETRIEVAL (Online)                                            │
│   ┌──────────┐    ┌──────────┐    ┌──────────┐    ┌──────────┐  │
│   │  Query   │───>│  Embed   │───>│  Search  │───>│  Rerank  │  │
│   │          │    │  Query   │    │  Top-K   │    │  Top-N   │  │
│   └──────────┘    └──────────┘    └──────────┘    └──────────┘  │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

2.1 Load the Embedding Model

The llama-nemotron-embed-vl-1b-v2 model creates semantic vector representations of both text and images. This allows us to:

  • Text-only embedding: Standard document search
  • Image-only embedding: Search over screenshots, diagrams, slides
  • Image+Text pairs: Maximum retrieval accuracy for rich documents

The model uses different context lengths for each mode to optimize quality.

[3]
Note: Environment variable`HF_TOKEN` is set and is the current active token independently from the token you've just configured.
✅ Logged into HuggingFace!
Loading embedding model: nvidia/llama-nemotron-embed-vl-1b-v2...
✅ Embedding model loaded!

2.2 Load the Reranking Model

Initial retrieval casts a wide net using fast vector similarity. The reranker then performs deeper query-document interaction to surface the most relevant results.

Why rerank? Embedding-based retrieval is fast but approximate. The reranker reads each candidate document alongside the query, enabling cross-attention between them. This improves accuracy by ~6-7% on benchmarks.

The llama-nemotron-rerank-vl-1b-v2 model handles both text and image documents, using the same multimodal architecture as the embedding model.

[4]
Loading reranking model: nvidia/llama-nemotron-rerank-vl-1b-v2...
✅ Reranking model loaded!

2.3 Build a Sample Knowledge Base

Let's create a small knowledge base with both text and images to demonstrate multimodal retrieval. In production, you would index your actual documents, PDFs, and images here.

Sample topics:

  • NVIDIA Isaac Lab (robotics)
  • Autonomous vehicles (NVIDIA DRIVE)
  • Nemotron 3 Nano architecture
  • Genomics research (Evo-2)
  • RAG fundamentals
[5]
✅ Loaded 5 documents
   - With images: 3
   - Text only: 2
[6]
Creating document embeddings...
✅ Created embeddings with shape: (5, 2048)
[7]
✅ FAISS index created with 5 vectors
[8]
Query: How is AI used in robotics?

Top 2 results:
  1. NVIDIA Isaac Lab is a unified framework for robot learning built on Isaac Sim. It provides modular c...
     Has image: True
  2. Autonomous vehicles use AI for perception, planning, and control. NVIDIA DRIVE provides the compute ...
     Has image: True

Step 3: Add Real-Time Speech with Nemotron Speech ASR

Now we add voice input capability. The nemotron-speech-streaming-en-0.6b model converts spoken audio to text with ultra-low latency.

┌─────────────────────────────────────────────────────────────┐
│                    ASR PIPELINE                             │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│   ┌──────────┐    ┌──────────┐    ┌──────────┐              │
│   │  Audio   │───>│  NeMo    │───>│  Text    │              │
│   │  Stream  │    │  ASR     │    │  Output  │              │
│   └──────────┘    └──────────┘    └──────────┘              │
│                        │                                    │
│                        v                                    │
│              ┌─────────────────┐                            │
│              │ + Punctuation   │                            │
│              │ + Capitalization│                            │
│              └─────────────────┘                            │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Key features:

  • Trained on 285k-hour Granary dataset
  • 7.16% average WER on Open ASR Leaderboard
  • Cache-aware streaming for real-time applications
  • Built-in punctuation and capitalization
  • Configurable latency: 80ms to 1.1s chunk sizes
[9]
[NeMo W 2026-01-06 00:00:29 nemo_logging:405] Megatron num_microbatches_calculator not found, using Apex version.
OneLogger: Setting error_handling_strategy to DISABLE_QUIETLY_AND_REPORT_METRIC_ERROR for rank (rank=0) with OneLogger disabled. To override: explicitly set error_handling_strategy parameter.
No exporters were provided. This means that no telemetry data will be collected.
[NeMo W 2026-01-06 00:00:30 nemo_logging:405] /home/chris/projects/use-case-examples/nemotron-voice-rag-agent-example/.venv/lib/python3.12/site-packages/pydub/utils.py:170: RuntimeWarning: Couldn't find ffmpeg or avconv - defaulting to ffmpeg, but may not work
      warn("Couldn't find ffmpeg or avconv - defaulting to ffmpeg, but may not work", RuntimeWarning)
    
Loading ASR model: nvidia/nemotron-speech-streaming-en-0.6b...
[NeMo I 2026-01-06 00:00:32 nemo_logging:393] Tokenizer SentencePieceTokenizer initialized with 1024 tokens
[NeMo W 2026-01-06 00:00:32 nemo_logging:405] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    use_lhotse: true
    skip_missing_manifest_entries: true
    input_cfg: /lustre/fs12/portfolios/llmservice/projects/llmservice_nemo_speechlm/users/weiqingw/manifests/input_cfg/am-os_fl_ll_mc_mm_mo_no_su_yo_yt_gsc_en.yaml
    tarred_audio_filepaths: null
    manifest_filepath: null
    sample_rate: 16000
    shuffle: true
    num_workers: 2
    pin_memory: true
    max_duration: 40.0
    min_duration: 0.1
    text_field: answer
    batch_duration: null
    use_bucketing: true
    max_tps:
    - 10.92
    - 11.16
    - 10.68
    - 10.22
    - 9.98
    - 9.67
    - 9.5
    - 9.36
    - 9.04
    - 9.38
    - 8.81
    - 8.78
    - 8.24
    - 8.85
    - 9.25
    bucket_duration_bins:
    - - 5.76
      - 62
    - - 7.12
      - 77
    - - 8.32
      - 83
    - - 9.44
      - 92
    - - 10.5
      - 103
    - - 11.68
      - 111
    - - 12.88
      - 117
    - - 14.08
      - 130
    - - 15.44
      - 138
    - - 17.2
      - 156
    - - 19.36
      - 158
    - - 22.4
      - 189
    - - 26.64
      - 217
    - - 32.8
      - 272
    - - 40.1
      - 352
    bucket_batch_size:
    - 100
    - 100
    - 80
    - 80
    - 50
    - 50
    - 50
    - 50
    - 40
    - 30
    - 20
    - 20
    - 15
    - 10
    - 3
    num_buckets: 15
    bucket_buffer_size: 7500
    shuffle_buffer_size: 5000
    
[NeMo W 2026-01-06 00:00:32 nemo_logging:405] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s). 
    Validation config : 
    use_lhotse: true
    manifest_filepath:
    - /lustre/fsw/portfolios/llmservice/projects/llmservice_nemo_speechlm/data/canary/canary_v0/manifests/data/ASR/MMLPC/en/val_test/mcv11/mcv11_dev_clean_pcstrip_en_2k.json
    sample_rate: 16000
    batch_size: 4
    shuffle: false
    max_duration: 40.0
    min_duration: 0.1
    num_workers: 2
    pin_memory: true
    text_field: answer
    
[NeMo I 2026-01-06 00:00:32 nemo_logging:393] PADDING: 0
[NeMo I 2026-01-06 00:00:34 nemo_logging:393] Using RNNT Loss : warprnnt_numba
    Loss warprnnt_numba_kwargs: {'fastemit_lambda': 0.005, 'clamp': -1.0}
[NeMo I 2026-01-06 00:00:34 nemo_logging:393] Using RNNT Loss : warprnnt_numba
    Loss warprnnt_numba_kwargs: {'fastemit_lambda': 0.005, 'clamp': -1.0}
[NeMo I 2026-01-06 00:00:34 nemo_logging:393] Using RNNT Loss : warprnnt_numba
    Loss warprnnt_numba_kwargs: {'fastemit_lambda': 0.005, 'clamp': -1.0}
[NeMo I 2026-01-06 00:00:35 nemo_logging:393] Model EncDecRNNTBPEModel was successfully restored from /home/chris/.cache/huggingface/hub/models--nvidia--nemotron-speech-streaming-en-0.6b/snapshots/e730059607cecd9cccf501d8a39f5d22f0993db8/nemotron-speech-streaming-en-0.6b.nemo.
  → Disabled CUDA graphs on decoding_computer
✅ ASR model loaded!
[10]
[NeMo W 2026-01-06 00:00:38 nemo_logging:405] The following configuration keys are ignored by Lhotse dataloader: use_start_end_token
[NeMo W 2026-01-06 00:00:38 nemo_logging:405] You are using a non-tarred dataset and requested tokenization during data sampling (pretokenize=True). This will cause the tokenization to happen in the main (GPU) process,possibly impacting the training speed if your tokenizer is very large.If the impact is noticable, set pretokenize=False in dataloader config.(note: that will disable token-per-second filtering and 2D bucketing features)
Transcribing: 1it [00:00, 12.34it/s]

📝 Transcription: Hypothesis(score=-465.7001953125, y_sequence=tensor([112, 127,  41, 685, 342, 291,  32, 120, 143, 160, 358, 963,  54, 589,
        977]), text='Could you please tell me about robotics?', dec_out=None, dec_state=None, timestamp=[], alignments=None, frame_confidence=None, token_confidence=None, word_confidence=None, length=0, y=None, lm_state=None, lm_scores=None, ngram_lm_state=None, tokens=None, last_token=None, token_duration=None, last_frame=None)


Step 4: Enforce Safety with Nemotron Content Safety and PII Detection

Production agents need guardrails. The Llama-3.1-Nemotron-Safety-Guard-8B-v3 model checks both user inputs and agent outputs for safety violations.

┌─────────────────────────────────────────────────────────────┐
│                    SAFETY PIPELINE                          │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│   ┌──────────┐         ┌──────────────────┐                 │
│   │  User    │────────>│                  │                 │
│   │  Query   │         │   Safety Guard   │──> Safe/Unsafe  │
│   └──────────┘         │                  │                 │
│                        │  23 Categories:  │                 │
│   ┌──────────┐         │  - Violence      │                 │
│   │  Agent   │────────>│  - PII/Privacy   │                 │
│   │ Response │         │  - Harassment    │                 │
│   └──────────┘         │  - Fraud         │                 │
│                        │  - Malware       │                 │
│                        │  - etc.          │                 │
│                        └──────────────────┘                 │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Key features:

  • Multilingual support (20+ languages)
  • PII detection (emails, SSNs, phone numbers)
  • Cultural context awareness
  • 23 safety categories
  • Works with noisy ASR output
[11]
Loading safety model: nvidia/Llama-3.1-Nemotron-Safety-Guard-8B-v3...
Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]
✅ Safety model loaded!
[12]
The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
🛡️ Safety Check Results:
The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.

1. Query: How does AI improve robotics?...
   Result: {'User Safety': 'safe', 'Response Safety': 'safe'}

2. Query: My email is john@example.com and my SSN is 123-45-...
   Result: {'User Safety': 'unsafe', 'Safety Categories': 'PII/Privacy'}

Step 5: Add Long-Context Reasoning with Nemotron 3 Nano

With retrieval, speech, and safety in place, we add the reasoning engine. Nemotron 3 Nano processes the retrieved context and generates intelligent responses.

┌─────────────────────────────────────────────────────────────┐
│                 REASONING PIPELINE                          │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│   ┌──────────┐                                              │
│   │ Retrieved│─────┐                                        │
│   │   Docs   │     │                                        │
│   └──────────┘     │    ┌────────────────┐                  │
│                    ├───>│                │                  │
│   ┌──────────┐     │    │  Nemotron 3    │    ┌──────────┐  │
│   │  Image   │─────┤    │     Nano       │───>│ Response │  │
│   │  Descs   │     │    │                │    └──────────┘  │
│   └──────────┘     │    │  1M tokens     │                  │
│                    │    │  Mamba+Trans   │                  │
│   ┌──────────┐     │    └────────────────┘                  │
│   │   User   │─────┘           │                            │
│   │  Query   │          ┌──────┴──────┐                     │
│   └──────────┘          │  Optional   │                     │
│                         │  Thinking   │                     │
│                         │    Mode     │                     │
│                         └─────────────┘                     │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Architecture highlights:

  • 1M token context: Fit entire document collections in a single request
  • Mamba-Transformer hybrid: Efficient inference on long sequences
  • Thinking mode: Optional step-by-step reasoning for complex queries

For images in retrieved documents, we first use Nemotron Nano VL to describe them, then include those descriptions in the context for Nemotron 3 Nano.

[13]
✅ Nemotron LLM initialized!
[14]
🤖 Response: 
Isaac Lab is used as a unified framework for robot learning that enables the development and testing of AI‑driven behaviors—such as locomotion, manipulation, and navigation—through modular components built on NVIDIA Isaac Sim.

Step 6: Build a Voice-Powered LangChain v1.0 Agent with RAG

Now we'll create a LangChain v1.0 agent using the modern create_agent API. The agent automatically loops, calling the RAG tool as needed until it can answer your voice query.

┌─────────────────────────────────────────────────────────────────────────────┐
│                   LANGCHAIN v1.0 AGENT PIPELINE                             │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│   🎤 Voice Input (Microphone)                                               |
│       │                                                                     │
│       v                                                                     │
│   ASR (Nemotron) → Text Query                                               │
│       │                                                                     │
│       v                                                                     │
│   🛡️ STEP 2: Safety Check (Input) ─────────────────────> ❌ REJECT         │
│       │                                                                     │
│       v (if safe)                                                           │
│   ┌────────────────────────────────────────────────────────────────┐        │
│   │  STEP 3: create_agent Loop (CompiledStateGraph)                │        │
│   │                                                                │        │
│   │  Agent (ChatNVIDIA) ──> Decides: Need RAG?                     │        │
│   │     │                │                                         │        │
│   │     NO               YES                                       │        │
│   │     │                │                                         │        │
│   │     │                v                                         │        │
│   │     │           RAG Tool ───────────────────────┐              │        │
│   │     │           │                               │              │        │
│   │     │           v                               │              │        │
│   │     │       Has Images? ──YES──> VLM Describe   │              │        │
│   │     │           │                    │          │              │        │
│   │     │           NO                   v          │              │        │
│   │     │           │           Add to Context      │              │        │
│   │     │           └──────────────────────┬────────┘              │        │
│   │     │                                   │                      │        │
│   │     v                                   v                      │        │
│   │  Generate Response <────────────────────┘                      │        │
│   │                                                                │        │
│   │  (Loop continues until agent is satisfied)                     │        │
│   └────────────────────────────────────────────────────────────────┘        │
│       │                                                                     │
│       v                                                                     │
│   🛡️ STEP 4: Safety Check (Output) ─────────────────────> ❌ FILTER        │
│       │                                                                     │
│       v (if safe)                                                           │
│   ✅ STEP 5: Return Safe Response                                           │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Key Features:

  • Uses langchain.agents.create_agent (LangChain v1.0 API)
  • Returns CompiledStateGraph that auto-loops until complete
  • Integrates with ChatNVIDIA for NVIDIA API endpoints
  • Uses VLM when images are retrieved in context
  • Safety ALWAYS enforced on input AND output

First, define the RAG tool:

[15]
✅ RAG tool defined!
   Function: search_knowledge_base
   Capabilities: Multimodal retrieval + reranking + image description
[16]
✅ LangChain v1.0 Agent created!
   API: langchain.agents.create_agent (LangChain v1.0)
   Model: nvidia/nemotron-3-nano-30b-a3b via ChatNVIDIA
   Tools: search_knowledge_base
   Returns: CompiledStateGraph (auto-loops until complete)
[17]
✅ Voice RAG Agent initialized!

Pipeline Flow:
  1. 🎤 Voice Input → Nemotron ASR → Text
  2. 🛡️ Safety Check (input) → Reject if unsafe
  3. 🤖 Agent Loop (LangChain v1.0 create_agent)
     └─ RAG Tool → Embed + Rerank + VLM (if images)
  4. 🛡️ Safety Check (output) → Filter if unsafe
  5. ✅ Return safe response

Interactive Voice Interface

Now let's create an interactive microphone recorder to query the agent with your voice!

[ ]
======================================================================
📁 OPTION 1: Upload Audio File (Recommended for SSH/Remote)
======================================================================
Upload a .wav, .mp3, .flac, .webm, or .ogg file from your local machine.

FileUpload(value=(), accept='.wav,.mp3,.flac,.webm,.ogg,.m4a', description='Upload Audio', layout=Layout(width…

======================================================================
🎙️ OPTION 2: Browser Microphone (Local Jupyter Only)
======================================================================
⚠️  This requires localhost or HTTPS. Will show 'Permission denied' over SSH.


======================================================================
▶️  PROCESS AUDIO
======================================================================
Checkbox(value=False, description='Use sample audio (robotics.flac) for testing', layout=Layout(width='400px')…
Button(button_style='success', description='🎤 Process Audio', layout=Layout(height='50px', width='300px'), sty…
Output()

Summary

You've built a voice-powered LangChain v1.0 agent with RAG using NVIDIA Nemotron models:

┌─────────────────────────────────────────────────────────────────────────────┐
│                     VOICE RAG AGENT STACK (LangChain v1.0)                  │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│   Component          Model                           Purpose                │
│   ─────────          ─────                           ───────                │
│   🎤 ASR             nemotron-speech-streaming       Voice → Text           │
│   📚 Embeddings      llama-nemotron-embed-vl         Semantic search        │
│   🔄 Reranking       llama-nemotron-rerank-vl        Sharpen accuracy       │
│   🖼️ Vision (VLM)    nemotron-nano-12b-vl            Describe images        │ 
│   🤖 Agent LLM       nemotron-3-nano-30b             Agent reasoning        │
│   🛡️ Safety          Llama-3.1-Nemotron-Safety       Input/Output checks    │
│                                                                             │
│   Architecture:      LangChain v1.0 Agent (CompiledStateGraph)              │
│   API:               langchain.agents.create_agent                          │
│   Model:             ChatNVIDIA (langchain_nvidia_ai_endpoints)             │
│   Tools:             - search_knowledge_base (RAG + VLM on-demand)          │
│   Input:             🎙️ Voice only (microphone)                             │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

The 5-Step Pipeline

┌───────────────────────────────────────────────────────────────┐
│  STEP 1: Voice Input                                          │
│  🎤 Microphone → Nemotron ASR → Text Query                   │
└───────────────────────────────────────────────────────────────┘
                              │
                              v
┌───────────────────────────────────────────────────────────────┐
│  STEP 2: Input Safety Check                                   │
│  🛡️ Nemotron Safety Guard validates input                     │
│      ├── UNSAFE → ❌ Reject immediately                       │
│      └── SAFE   → ✅ Continue to agent                        │
└───────────────────────────────────────────────────────────────┘
                              │
                              v
┌───────────────────────────────────────────────────────────────┐
│  STEP 3: Agent Processing (create_agent loop)                 │
│  🤖 Agent decides: Need more information?                     │
│      ├── YES → Call RAG Tool                                  │
│      │         └── Retrieve + Rerank docs                     │
│      │         └── If images → VLM describes them             │
│      │         └── Return context to agent                    │
│      │         └── Loop back to decide again                  │
│      └── NO  → Generate final response                        │
└───────────────────────────────────────────────────────────────┘
                              │
                              v
┌───────────────────────────────────────────────────────────────┐
│  STEP 4: Output Safety Check                                  │
│  🛡️ Nemotron Safety Guard validates response                  │
│      ├── UNSAFE → ❌ Filter/block response                    │
│      └── SAFE   → ✅ Continue to output                       │
└───────────────────────────────────────────────────────────────┘
                              │
                              v
┌───────────────────────────────────────────────────────────────┐
│  STEP 5: Return Safe Response                                 │
│  ✅ Deliver grounded, safe response to user                   │
└───────────────────────────────────────────────────────────────┘

LangChain v1.0 Implementation Details

Modern create_agent API:

  • ✅ Uses langchain.agents.create_agent (LangChain v1.0)
  • ✅ Returns CompiledStateGraph that auto-loops
  • ✅ Integrates with ChatNVIDIA for NVIDIA endpoints
  • ✅ Simple message format: {"messages": [("user", query)]}

Voice-First Design:

  • 🎤 HTML5 MediaRecorder for browser microphone access
  • 🎙️ No webcam - audio only
  • 📝 Automatic transcription with Nemotron ASR

Safety & Quality:

  • 🛡️ Input safety check (before agent runs)
  • 🛡️ Output safety check (after agent completes)
  • 🖼️ VLM processes images when retrieved
  • 🔍 Multimodal RAG (text + images)
  • 🎯 Reranking for accuracy (+6-7%)

Next Steps

  • Add more tools: Web search, calculators, code execution
  • Enable streaming: Use agent.stream() for real-time responses
  • Add memory: Conversation history across multiple queries
  • Deploy: Use NVIDIA NIM for production inference
  • Custom data: Index your own documents and images

Resources