Voice Rag Agent Tutorial
Building a Voice-Powered RAG Agent with NVIDIA Nemotron Models
This notebook walks you through building an end-to-end AI agent that combines voice input, multimodal retrieval, safety guardrails, and long-context reasoning using NVIDIA's Nemotron model family.
┌─────────────────────────────────────────────────────────────────────────────┐
│ VOICE-POWERED RAG AGENT │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ Voice │───>│ ASR │───>│ RAG │───>│ LLM │───>│ Safety │ │
│ │ Input │ │ (NeMo) │ │ Embed+ │ │ Reason │ │ Guard │ │
│ └─────────┘ └─────────┘ │ Rerank │ └─────────┘ └─────────┘ │
│ └─────────┘ │ │
│ │ │ │
│ v v │
│ ┌─────────┐ ┌─────────┐ │
│ │ FAISS │ │ Safe │ │
│ │ Index │ │ Response│ │
│ └─────────┘ └─────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Models Used
| Component | Model | Deployment |
|---|---|---|
| Speech-to-Text | nemotron-speech-streaming-en-0.6b | Self-hosted (NeMo) |
| Embeddings | llama-nemotron-embed-vl-1b-v2 | Self-hosted (Transformers) |
| Reranking | llama-nemotron-rerank-vl-1b-v2 | Self-hosted (Transformers) |
| Vision-Language | nemotron-nano-12b-v2-vl | NVIDIA API |
| Reasoning | nemotron-3-nano-30b-a3b | NVIDIA API |
| Safety | Llama-3.1-Nemotron-Safety-Guard-8B-v3 | Self-hosted (Transformers) |
Prerequisites
- NVIDIA GPU with 24GB+ VRAM (for self-hosted models)
- NVIDIA API key (for cloud-hosted reasoning models)
- Python 3.10+
Step 1: Environment Setup
Before we begin, we need to install the required dependencies and configure API access.
What gets installed:
- LangChain v1.0: Modern agent orchestration with
create_agentAPI - langchain-nvidia-ai-endpoints: ChatNVIDIA integration for NVIDIA API
- Transformers + PyTorch: For running local embedding, reranking, and safety models
- FAISS: Vector similarity search
- NeMo Toolkit: NVIDIA's ASR framework
- ipywebrtc: Audio recording widget for Jupyter
# Install core dependencies (LangChain v1.0+)
pip install langchain langchain-nvidia-ai-endpoints faiss-cpu transformers torch pillow requests jinja2 soundfile
# Install NeMo for ASR
pip install nemo_toolkit[asr]
# Install audio recording widget
pip install ipywebrtc
jupyter nbextension enable --py widgetsnbextension
✅ Environment configured successfully!
✅ GPU available: NVIDIA RTX 6000 Ada Generation Memory: 47.6 GB
Step 2: Ground the Agent with Multimodal RAG
Retrieval-Augmented Generation (RAG) grounds our agent in real data, preventing hallucinations by providing factual context for every response.
┌─────────────────────────────────────────────────────────────────┐
│ MULTIMODAL RAG PIPELINE │
├─────────────────────────────────────────────────────────────────┤
│ │
│ INDEXING (Offline) │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Text │───>│ Embed │───>│ FAISS │ │
│ │ Docs │ │ Model │ │ Index │ │
│ └──────────┘ └──────────┘ └──────────┘ │
│ ┌──────────┐ │ │
│ │ Images │─────────┘ │
│ └──────────┘ │
│ │
│ RETRIEVAL (Online) │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Query │───>│ Embed │───>│ Search │───>│ Rerank │ │
│ │ │ │ Query │ │ Top-K │ │ Top-N │ │
│ └──────────┘ └──────────┘ └──────────┘ └──────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘
2.1 Load the Embedding Model
The llama-nemotron-embed-vl-1b-v2 model creates semantic vector representations of both text and images. This allows us to:
- Text-only embedding: Standard document search
- Image-only embedding: Search over screenshots, diagrams, slides
- Image+Text pairs: Maximum retrieval accuracy for rich documents
The model uses different context lengths for each mode to optimize quality.
Note: Environment variable`HF_TOKEN` is set and is the current active token independently from the token you've just configured.
✅ Logged into HuggingFace! Loading embedding model: nvidia/llama-nemotron-embed-vl-1b-v2... ✅ Embedding model loaded!
2.2 Load the Reranking Model
Initial retrieval casts a wide net using fast vector similarity. The reranker then performs deeper query-document interaction to surface the most relevant results.
Why rerank? Embedding-based retrieval is fast but approximate. The reranker reads each candidate document alongside the query, enabling cross-attention between them. This improves accuracy by ~6-7% on benchmarks.
The llama-nemotron-rerank-vl-1b-v2 model handles both text and image documents, using the same multimodal architecture as the embedding model.
Loading reranking model: nvidia/llama-nemotron-rerank-vl-1b-v2... ✅ Reranking model loaded!
2.3 Build a Sample Knowledge Base
Let's create a small knowledge base with both text and images to demonstrate multimodal retrieval. In production, you would index your actual documents, PDFs, and images here.
Sample topics:
- NVIDIA Isaac Lab (robotics)
- Autonomous vehicles (NVIDIA DRIVE)
- Nemotron 3 Nano architecture
- Genomics research (Evo-2)
- RAG fundamentals
✅ Loaded 5 documents - With images: 3 - Text only: 2
Creating document embeddings... ✅ Created embeddings with shape: (5, 2048)
✅ FAISS index created with 5 vectors
Query: How is AI used in robotics?
Top 2 results:
1. NVIDIA Isaac Lab is a unified framework for robot learning built on Isaac Sim. It provides modular c...
Has image: True
2. Autonomous vehicles use AI for perception, planning, and control. NVIDIA DRIVE provides the compute ...
Has image: True
Step 3: Add Real-Time Speech with Nemotron Speech ASR
Now we add voice input capability. The nemotron-speech-streaming-en-0.6b model converts spoken audio to text with ultra-low latency.
┌─────────────────────────────────────────────────────────────┐
│ ASR PIPELINE │
├─────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Audio │───>│ NeMo │───>│ Text │ │
│ │ Stream │ │ ASR │ │ Output │ │
│ └──────────┘ └──────────┘ └──────────┘ │
│ │ │
│ v │
│ ┌─────────────────┐ │
│ │ + Punctuation │ │
│ │ + Capitalization│ │
│ └─────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────┘
Key features:
- Trained on 285k-hour Granary dataset
- 7.16% average WER on Open ASR Leaderboard
- Cache-aware streaming for real-time applications
- Built-in punctuation and capitalization
- Configurable latency: 80ms to 1.1s chunk sizes
[NeMo W 2026-01-06 00:00:29 nemo_logging:405] Megatron num_microbatches_calculator not found, using Apex version.
OneLogger: Setting error_handling_strategy to DISABLE_QUIETLY_AND_REPORT_METRIC_ERROR for rank (rank=0) with OneLogger disabled. To override: explicitly set error_handling_strategy parameter.
No exporters were provided. This means that no telemetry data will be collected.
[NeMo W 2026-01-06 00:00:30 nemo_logging:405] /home/chris/projects/use-case-examples/nemotron-voice-rag-agent-example/.venv/lib/python3.12/site-packages/pydub/utils.py:170: RuntimeWarning: Couldn't find ffmpeg or avconv - defaulting to ffmpeg, but may not work
warn("Couldn't find ffmpeg or avconv - defaulting to ffmpeg, but may not work", RuntimeWarning)
Loading ASR model: nvidia/nemotron-speech-streaming-en-0.6b... [NeMo I 2026-01-06 00:00:32 nemo_logging:393] Tokenizer SentencePieceTokenizer initialized with 1024 tokens
[NeMo W 2026-01-06 00:00:32 nemo_logging:405] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
Train config :
use_lhotse: true
skip_missing_manifest_entries: true
input_cfg: /lustre/fs12/portfolios/llmservice/projects/llmservice_nemo_speechlm/users/weiqingw/manifests/input_cfg/am-os_fl_ll_mc_mm_mo_no_su_yo_yt_gsc_en.yaml
tarred_audio_filepaths: null
manifest_filepath: null
sample_rate: 16000
shuffle: true
num_workers: 2
pin_memory: true
max_duration: 40.0
min_duration: 0.1
text_field: answer
batch_duration: null
use_bucketing: true
max_tps:
- 10.92
- 11.16
- 10.68
- 10.22
- 9.98
- 9.67
- 9.5
- 9.36
- 9.04
- 9.38
- 8.81
- 8.78
- 8.24
- 8.85
- 9.25
bucket_duration_bins:
- - 5.76
- 62
- - 7.12
- 77
- - 8.32
- 83
- - 9.44
- 92
- - 10.5
- 103
- - 11.68
- 111
- - 12.88
- 117
- - 14.08
- 130
- - 15.44
- 138
- - 17.2
- 156
- - 19.36
- 158
- - 22.4
- 189
- - 26.64
- 217
- - 32.8
- 272
- - 40.1
- 352
bucket_batch_size:
- 100
- 100
- 80
- 80
- 50
- 50
- 50
- 50
- 40
- 30
- 20
- 20
- 15
- 10
- 3
num_buckets: 15
bucket_buffer_size: 7500
shuffle_buffer_size: 5000
[NeMo W 2026-01-06 00:00:32 nemo_logging:405] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s).
Validation config :
use_lhotse: true
manifest_filepath:
- /lustre/fsw/portfolios/llmservice/projects/llmservice_nemo_speechlm/data/canary/canary_v0/manifests/data/ASR/MMLPC/en/val_test/mcv11/mcv11_dev_clean_pcstrip_en_2k.json
sample_rate: 16000
batch_size: 4
shuffle: false
max_duration: 40.0
min_duration: 0.1
num_workers: 2
pin_memory: true
text_field: answer
[NeMo I 2026-01-06 00:00:32 nemo_logging:393] PADDING: 0
[NeMo I 2026-01-06 00:00:34 nemo_logging:393] Using RNNT Loss : warprnnt_numba
Loss warprnnt_numba_kwargs: {'fastemit_lambda': 0.005, 'clamp': -1.0}
[NeMo I 2026-01-06 00:00:34 nemo_logging:393] Using RNNT Loss : warprnnt_numba
Loss warprnnt_numba_kwargs: {'fastemit_lambda': 0.005, 'clamp': -1.0}
[NeMo I 2026-01-06 00:00:34 nemo_logging:393] Using RNNT Loss : warprnnt_numba
Loss warprnnt_numba_kwargs: {'fastemit_lambda': 0.005, 'clamp': -1.0}
[NeMo I 2026-01-06 00:00:35 nemo_logging:393] Model EncDecRNNTBPEModel was successfully restored from /home/chris/.cache/huggingface/hub/models--nvidia--nemotron-speech-streaming-en-0.6b/snapshots/e730059607cecd9cccf501d8a39f5d22f0993db8/nemotron-speech-streaming-en-0.6b.nemo.
→ Disabled CUDA graphs on decoding_computer
✅ ASR model loaded!
[NeMo W 2026-01-06 00:00:38 nemo_logging:405] The following configuration keys are ignored by Lhotse dataloader: use_start_end_token [NeMo W 2026-01-06 00:00:38 nemo_logging:405] You are using a non-tarred dataset and requested tokenization during data sampling (pretokenize=True). This will cause the tokenization to happen in the main (GPU) process,possibly impacting the training speed if your tokenizer is very large.If the impact is noticable, set pretokenize=False in dataloader config.(note: that will disable token-per-second filtering and 2D bucketing features) Transcribing: 1it [00:00, 12.34it/s]
📝 Transcription: Hypothesis(score=-465.7001953125, y_sequence=tensor([112, 127, 41, 685, 342, 291, 32, 120, 143, 160, 358, 963, 54, 589,
977]), text='Could you please tell me about robotics?', dec_out=None, dec_state=None, timestamp=[], alignments=None, frame_confidence=None, token_confidence=None, word_confidence=None, length=0, y=None, lm_state=None, lm_scores=None, ngram_lm_state=None, tokens=None, last_token=None, token_duration=None, last_frame=None)
Step 4: Enforce Safety with Nemotron Content Safety and PII Detection
Production agents need guardrails. The Llama-3.1-Nemotron-Safety-Guard-8B-v3 model checks both user inputs and agent outputs for safety violations.
┌─────────────────────────────────────────────────────────────┐
│ SAFETY PIPELINE │
├─────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────┐ ┌──────────────────┐ │
│ │ User │────────>│ │ │
│ │ Query │ │ Safety Guard │──> Safe/Unsafe │
│ └──────────┘ │ │ │
│ │ 23 Categories: │ │
│ ┌──────────┐ │ - Violence │ │
│ │ Agent │────────>│ - PII/Privacy │ │
│ │ Response │ │ - Harassment │ │
│ └──────────┘ │ - Fraud │ │
│ │ - Malware │ │
│ │ - etc. │ │
│ └──────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────┘
Key features:
- Multilingual support (20+ languages)
- PII detection (emails, SSNs, phone numbers)
- Cultural context awareness
- 23 safety categories
- Works with noisy ASR output
Loading safety model: nvidia/Llama-3.1-Nemotron-Safety-Guard-8B-v3...
Loading checkpoint shards: 0%| | 0/4 [00:00<?, ?it/s]
✅ Safety model loaded!
The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details. Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
🛡️ Safety Check Results:
The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details. Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
1. Query: How does AI improve robotics?...
Result: {'User Safety': 'safe', 'Response Safety': 'safe'}
2. Query: My email is john@example.com and my SSN is 123-45-...
Result: {'User Safety': 'unsafe', 'Safety Categories': 'PII/Privacy'}
Step 5: Add Long-Context Reasoning with Nemotron 3 Nano
With retrieval, speech, and safety in place, we add the reasoning engine. Nemotron 3 Nano processes the retrieved context and generates intelligent responses.
┌─────────────────────────────────────────────────────────────┐
│ REASONING PIPELINE │
├─────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────┐ │
│ │ Retrieved│─────┐ │
│ │ Docs │ │ │
│ └──────────┘ │ ┌────────────────┐ │
│ ├───>│ │ │
│ ┌──────────┐ │ │ Nemotron 3 │ ┌──────────┐ │
│ │ Image │─────┤ │ Nano │───>│ Response │ │
│ │ Descs │ │ │ │ └──────────┘ │
│ └──────────┘ │ │ 1M tokens │ │
│ │ │ Mamba+Trans │ │
│ ┌──────────┐ │ └────────────────┘ │
│ │ User │─────┘ │ │
│ │ Query │ ┌──────┴──────┐ │
│ └──────────┘ │ Optional │ │
│ │ Thinking │ │
│ │ Mode │ │
│ └─────────────┘ │
│ │
└─────────────────────────────────────────────────────────────┘
Architecture highlights:
- 1M token context: Fit entire document collections in a single request
- Mamba-Transformer hybrid: Efficient inference on long sequences
- Thinking mode: Optional step-by-step reasoning for complex queries
For images in retrieved documents, we first use Nemotron Nano VL to describe them, then include those descriptions in the context for Nemotron 3 Nano.
✅ Nemotron LLM initialized!
🤖 Response: Isaac Lab is used as a unified framework for robot learning that enables the development and testing of AI‑driven behaviors—such as locomotion, manipulation, and navigation—through modular components built on NVIDIA Isaac Sim.
Step 6: Build a Voice-Powered LangChain v1.0 Agent with RAG
Now we'll create a LangChain v1.0 agent using the modern create_agent API. The agent automatically loops, calling the RAG tool as needed until it can answer your voice query.
┌─────────────────────────────────────────────────────────────────────────────┐
│ LANGCHAIN v1.0 AGENT PIPELINE │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ 🎤 Voice Input (Microphone) |
│ │ │
│ v │
│ ASR (Nemotron) → Text Query │
│ │ │
│ v │
│ 🛡️ STEP 2: Safety Check (Input) ─────────────────────> ❌ REJECT │
│ │ │
│ v (if safe) │
│ ┌────────────────────────────────────────────────────────────────┐ │
│ │ STEP 3: create_agent Loop (CompiledStateGraph) │ │
│ │ │ │
│ │ Agent (ChatNVIDIA) ──> Decides: Need RAG? │ │
│ │ │ │ │ │
│ │ NO YES │ │
│ │ │ │ │ │
│ │ │ v │ │
│ │ │ RAG Tool ───────────────────────┐ │ │
│ │ │ │ │ │ │
│ │ │ v │ │ │
│ │ │ Has Images? ──YES──> VLM Describe │ │ │
│ │ │ │ │ │ │ │
│ │ │ NO v │ │ │
│ │ │ │ Add to Context │ │ │
│ │ │ └──────────────────────┬────────┘ │ │
│ │ │ │ │ │
│ │ v v │ │
│ │ Generate Response <────────────────────┘ │ │
│ │ │ │
│ │ (Loop continues until agent is satisfied) │ │
│ └────────────────────────────────────────────────────────────────┘ │
│ │ │
│ v │
│ 🛡️ STEP 4: Safety Check (Output) ─────────────────────> ❌ FILTER │
│ │ │
│ v (if safe) │
│ ✅ STEP 5: Return Safe Response │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Key Features:
- Uses
langchain.agents.create_agent(LangChain v1.0 API) - Returns
CompiledStateGraphthat auto-loops until complete - Integrates with
ChatNVIDIAfor NVIDIA API endpoints - Uses VLM when images are retrieved in context
- Safety ALWAYS enforced on input AND output
First, define the RAG tool:
✅ RAG tool defined! Function: search_knowledge_base Capabilities: Multimodal retrieval + reranking + image description
✅ LangChain v1.0 Agent created! API: langchain.agents.create_agent (LangChain v1.0) Model: nvidia/nemotron-3-nano-30b-a3b via ChatNVIDIA Tools: search_knowledge_base Returns: CompiledStateGraph (auto-loops until complete)
✅ Voice RAG Agent initialized!
Pipeline Flow:
1. 🎤 Voice Input → Nemotron ASR → Text
2. 🛡️ Safety Check (input) → Reject if unsafe
3. 🤖 Agent Loop (LangChain v1.0 create_agent)
└─ RAG Tool → Embed + Rerank + VLM (if images)
4. 🛡️ Safety Check (output) → Filter if unsafe
5. ✅ Return safe response
Interactive Voice Interface
Now let's create an interactive microphone recorder to query the agent with your voice!
====================================================================== 📁 OPTION 1: Upload Audio File (Recommended for SSH/Remote) ====================================================================== Upload a .wav, .mp3, .flac, .webm, or .ogg file from your local machine.
FileUpload(value=(), accept='.wav,.mp3,.flac,.webm,.ogg,.m4a', description='Upload Audio', layout=Layout(width…
====================================================================== 🎙️ OPTION 2: Browser Microphone (Local Jupyter Only) ====================================================================== ⚠️ This requires localhost or HTTPS. Will show 'Permission denied' over SSH.
====================================================================== ▶️ PROCESS AUDIO ======================================================================
Checkbox(value=False, description='Use sample audio (robotics.flac) for testing', layout=Layout(width='400px')…
Button(button_style='success', description='🎤 Process Audio', layout=Layout(height='50px', width='300px'), sty…
Output()
Summary
You've built a voice-powered LangChain v1.0 agent with RAG using NVIDIA Nemotron models:
┌─────────────────────────────────────────────────────────────────────────────┐
│ VOICE RAG AGENT STACK (LangChain v1.0) │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ Component Model Purpose │
│ ───────── ───── ─────── │
│ 🎤 ASR nemotron-speech-streaming Voice → Text │
│ 📚 Embeddings llama-nemotron-embed-vl Semantic search │
│ 🔄 Reranking llama-nemotron-rerank-vl Sharpen accuracy │
│ 🖼️ Vision (VLM) nemotron-nano-12b-vl Describe images │
│ 🤖 Agent LLM nemotron-3-nano-30b Agent reasoning │
│ 🛡️ Safety Llama-3.1-Nemotron-Safety Input/Output checks │
│ │
│ Architecture: LangChain v1.0 Agent (CompiledStateGraph) │
│ API: langchain.agents.create_agent │
│ Model: ChatNVIDIA (langchain_nvidia_ai_endpoints) │
│ Tools: - search_knowledge_base (RAG + VLM on-demand) │
│ Input: 🎙️ Voice only (microphone) │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
The 5-Step Pipeline
┌───────────────────────────────────────────────────────────────┐
│ STEP 1: Voice Input │
│ 🎤 Microphone → Nemotron ASR → Text Query │
└───────────────────────────────────────────────────────────────┘
│
v
┌───────────────────────────────────────────────────────────────┐
│ STEP 2: Input Safety Check │
│ 🛡️ Nemotron Safety Guard validates input │
│ ├── UNSAFE → ❌ Reject immediately │
│ └── SAFE → ✅ Continue to agent │
└───────────────────────────────────────────────────────────────┘
│
v
┌───────────────────────────────────────────────────────────────┐
│ STEP 3: Agent Processing (create_agent loop) │
│ 🤖 Agent decides: Need more information? │
│ ├── YES → Call RAG Tool │
│ │ └── Retrieve + Rerank docs │
│ │ └── If images → VLM describes them │
│ │ └── Return context to agent │
│ │ └── Loop back to decide again │
│ └── NO → Generate final response │
└───────────────────────────────────────────────────────────────┘
│
v
┌───────────────────────────────────────────────────────────────┐
│ STEP 4: Output Safety Check │
│ 🛡️ Nemotron Safety Guard validates response │
│ ├── UNSAFE → ❌ Filter/block response │
│ └── SAFE → ✅ Continue to output │
└───────────────────────────────────────────────────────────────┘
│
v
┌───────────────────────────────────────────────────────────────┐
│ STEP 5: Return Safe Response │
│ ✅ Deliver grounded, safe response to user │
└───────────────────────────────────────────────────────────────┘
LangChain v1.0 Implementation Details
Modern create_agent API:
- ✅ Uses
langchain.agents.create_agent(LangChain v1.0) - ✅ Returns
CompiledStateGraphthat auto-loops - ✅ Integrates with
ChatNVIDIAfor NVIDIA endpoints - ✅ Simple message format:
{"messages": [("user", query)]}
Voice-First Design:
- 🎤 HTML5 MediaRecorder for browser microphone access
- 🎙️ No webcam - audio only
- 📝 Automatic transcription with Nemotron ASR
Safety & Quality:
- 🛡️ Input safety check (before agent runs)
- 🛡️ Output safety check (after agent completes)
- 🖼️ VLM processes images when retrieved
- 🔍 Multimodal RAG (text + images)
- 🎯 Reranking for accuracy (+6-7%)
Next Steps
- Add more tools: Web search, calculators, code execution
- Enable streaming: Use
agent.stream()for real-time responses - Add memory: Conversation history across multiple queries
- Deploy: Use NVIDIA NIM for production inference
- Custom data: Index your own documents and images