Agentic Rag With Nemo Retriever Nim
Agentic RAG pipeline with Nemo Retriever and NIM for LLMs
Overview
Retrieval-augmented generation (RAG) has proven to be an effective strategy for ensuring large language model (LLM) responses are up-to-date and not hallucinated.
Various retrieval strategies have been proposed that can improve the recall of documents for generation. There is no one-size-fits-all all. The strategy (for example: chunk size, number of documents returned, semantic search vs graph retrieval, etc.) depends on your data. Although the retrieval strategies might differ, an agentic framework designed on top of your retrieval system that does reasoning, decision-making, and reflection on your retrieved data is becoming more common in modern RAG systems. An agent can be described as a system that can use an LLM to reason through a problem, create a plan to solve the problem, and execute the plan with the help of a set of tools. For example, LLMs are notoriously bad at solving math problems, giving an LLM a calculator “tool” that it can use to perform mathematical tasks while it reasons through a larger problem of calculating YoY increase of a company’s revenue can be described as an agentic workflow.
As generative AI systems start transitioning towards entities capable of performing "agentic" tasks, we need robust models that have been trained on the ability to break down tasks, act as central planners, and have multi-step reasoning capabilities with model and system-level safety checks. With the Llama 3.1 family, Meta is launching a suite of LLMs spanning 8B, 70B, and 405B parameters with these tool-calling capabilities for agentic workloads. NVIDIA has partnered with Meta to make sure the latest Llama models can be deployed optimally through NVIDIA NIM microservices.
Further, with the general availability of the NVIDIA NeMo Retriever collection of NIM microservices, enterprises have access to scalable software to customize their data-dependent RAG pipelines. The NeMo Retriever NIM can be easily plugged into existing RAG pipelines and interfaces with open source LLM frameworks like LangChain or LlamaIndex, so you can easily integrate retriever models into generative AI applications.
Setup the Environment
First, let's install a few packages for interfacing with NVIDIA embedding, raranking, LLM models and vector databases.
Install the following system dependencies if they are not already available on your system with e.g. brew install for Mac. Depending on what document types you're parsing, you may not need all of these.
- poppler-utils (images and PDFs)
- tesseract-ocr(images and PDFs)
NeMo Retriever NIM
NeMo Retriever microservices can be used for embedding and reranking. These microservices can be deployed within the enterprise locally, and are packaged together with NVIDIA Triton Inference Server and NVIDIA TensorRT for optimized inference of text for embedding and reranking. Additional enterprise benefits include:
Scalable deployment: Whether you're catering to a few users or millions, NeMo Retriever embedding and reranking microservices can be scaled seamlessly to meet your demands.
Flexible integration: Easily incorporate NeMo Retriever embedding and reranking microservices into existing workflows and applications, thanks to the OpenAI-compliant API endpoints–and deploy anywhere your data resides.
Secure processing: Your data privacy is paramount. NeMo Retriever embedding and reranking microservices ensure that all inferences are processed securely, with rigorous data.
NeMo Retriever embedding and reranking NIM microservices are available today. Developers can download and deploy docker containers locally.
Access the Llama 3.1 405B model
The new Llama 3.1 set of models can be seen as the first big push of open-source models towards serious agentic capabilities. These models can now become part of a larger automation system, with LLMs doing the planning and picking the right tools to solve a larger problem. Since NVIDIA Llama 3.1 NIM has the necessary support for OpenAI style tool calling, libraries like LangChain can now be used with NIM microservices to bind LLMs to Pydantic classes and fill in objects/dictionaries. This combination makes it easier for developers to get structured outputs from NIM LLMs without having to resort to regex parsing. You can access Llama 3.1 405B at ai.nvidia.com. Follow these instructions to generate the API key
Architecture
Retrieving passages or documents within a RAG pipeline without further validation and self-reflection can usually result in unhelpful responses and factual inaccuracies. Additionally, since the models aren't explicitly trained to follow facts from passages, post-generation verification is necessary.
Multi-agent frameworks, like LangGraph, enable developers to group LLM application-level logic into nodes and edges, for finer levels of control over agentic decision-making. LangGraph with NVIDIA LangChain OSS connectors can be used for embedding, reranking, and implementing the necessary agentic RAG techniques with LLMs (as discussed previously).
To implement this, an application developer must include the finer-level decision-making on top of their RAG pipeline. Figure below shows one of the many renditions on a router node depending on the use case. Here, the router takes a decision to rewrite the query with help on an LLM, perchance of better recall from the retrieve.

Query decomposer: Breaks down the question into multiple smaller logical questions, and is helpful when a question needs to be answered using chunks from multiple documents.
Router: Decides if chunks need to be retrieved from the local retriever to answer the given question based on the relevancy of documents stored locally. Alternatively, the agent can be programmed to do a web search or simply answer with an ‘I don't know.’
Retriever: This is the internal implementation of the RAG pipeline. For example, a hybrid retriever of a semantic and keyword search retriever.
Grader: Checks if the retrieved passages/chunks are relevant to the question at hand.
Hallucination checker: Checks if the LLM generation from each chunk is relevant to the chunk. Post-generation verification is necessary since the models are not explicitly trained to follow facts from passages.
Download the dataset
Let's download the NIH clinical studies datasets from docugami repository. It cont
Step-1: Load and chunk the dataset
Use Langchain dataloaders to load all the PDF files in the created directory and split them into chunks of 500 characters each
Step-2: Initialize the Embedding, Reranking and LLM connectors
Embedding and Reranking NIM
Use the NVIDIA OSS connectors to langchain to initialize the embedding, reranking and LLM models, after setting up the embedding and reranking microservices locally using instructions here and here. point the base_url below to the ip address for your local machine.
Llama 3.1 405B LLM
The latest Llama 3.1 405B model is hosted on ai.nvidia.com. Use the instruction here to obtain the API Key for access
Step-3: Create a hybrid search retriever
Load the documents into a keyword search store and semantic search FAISS vector database. We create a weighted hybrid of a keyword and semantic search for better retrieval recall, and a higher score is given to the keyword search retriever because of domain specific medical jargon.
Step-4: Query decompostion with structured generation
The new Llama 3.1 set of models can be seen as the first big push of open-source models towards serious agentic capabilities. These models can now become part of a larger automation system, with LLMs doing the planning and picking the right tools to solve a larger problem. Because the NVIDIA Llama 3.1 NIM has the necessary support for OpenAI style tool calling, libraries like LangChain can now be used with microservices to bind LLMs to Pydantic classes and fill in objects/dictionaries. This combination makes it easier for developers to get structured outputs from NIM LLMs without having to resort to regex parsing.
Here we use Llama 3.1 NIM tool calling capability to split the initial query intp sub-queries
Step-5: Create a simple RAG chain with hybrid retriever
Step-6: Create a Retrieval grader with structured generation
Checks if the retrieved passages/chunks are relevant to the question at hand.
Step-7: Create a hallucination checker with structured generation
Checks if the LLM generation from each chunk is relevant to the chunk. Post-generation verification is necessary since the models are not explicitly trained to follow facts from passages.
Step-7: Create a answer grader with structured generation
Checks if the final answer resolves the supplied question
Step-8: Question rewriting
If none of retrieved documents are unrelated to the given question, then we ask the LLM to rewrite the question again for easier retrieval.
Step-9: Langgraph setup
Capture the flow in as a graph. Define the graph state, which is a data structure that is shared among the nodes of the graph, each node modifies the graph state depending on its function.
Step-10: Define the nodes as functions
Using the langchain constructs we have defined above for query decompostion, grading, retrieval, hallucination checking etc, we can write functions that act as nodes for the multi-agent graph.
Step-11: Define graph edges
The nodes defined above are connected to each other through functional edges, defined programatically. Based on the graph state the edges might pass the state information to one of the multiple different nodes.
Step-12: Build the graph
We define the rules for how the nodes are connected to each other, we also use conditional edges, which can connect to different nodes based on the output of the functional edge