RAG Langchain With Local NIM
Build a RAG using a locally hosted NIM
This notebook demonstrates how to build a RAG using NVIDIA NIM microservices. We locally host a Llama3-8b-instruct model using NVIDIA NIM for LLMs and connect to it using LangChain NVIDIA AI Endpoints package.
We then create a vector store by downloading web pages and generating their embeddings using FAISS. We then showcase two different chat chains for querying the vector store. For this example, we use the NVIDIA Triton documentation website, though the code can be easily modified to use any other source. For the embedding model, we use the GPU accelerated NV-Embed-QA model from NVIDIA API Catalog.
First stage is to load NVIDIA Triton documentation from the web, chunkify the data, and generate embeddings using FAISS
To get started:
-
Generate an NGC CLI API key. This key will need to be passed to docker run in the next section as the NGC_API_KEY environment variable to download the appropriate models and resources when starting the NIM.
-
Download and install the NGC CLI following the NGC Setup steps. Follow the steps on that page to set the NGC CLI and docker client configs appropriately.
Note: In order to run this notebook, you need to launch the NIM Docker container in the terminal outside of the web browser notebook environment. Run the commands in the first 3 cells from a terminal then begin with the 4th cell (curl inference command) within the notebook environment (web browser).
To pull the NIM container image from NGC, first authenticate with the NVIDIA Container Registry with the following command from your terminal.
Set up location for caching the model artifacts. Export the following env variables from your terminal.
Launch the NIM LLM microservice by executing this command from the terminal where you have exported all the environment variables.
Before we continue and connect the NIM to LangChain, let's test it using a simple OpenAI completion request. You can execute this command and all the subsequent one after this from your web browser.
Now setup the LangChain flow by installing prerequisite libraries
Set up NVIDIA API key, which you can get from the API Catalog. This key will be used to communicate with GPU accelerated cloud hosted embedding model.
We can now connect with the deployed NIM LLM model in LangChain by specifying the base URL
Import all the required libraries for building the langchain agent.
Helper functions for loading html files, which we'll use to generate the embeddings. We'll use this later to load the relevant html documents from the Triton documentation website and convert to a vector store.
Read html files and split text in preparation for embedding generation Note chunk_size value must match the specific LLM used for embedding genetation
Make sure to pay attention to the chunk_size parameter in TextSplitter. Setting the right chunk size is critical for RAG performance, as much of a RAG’s success is based on the retrieval step finding the right context for generation. The entire prompt (retrieved chunks + user query) must fit within the LLM’s context window. Therefore, you should not specify chunk sizes too big, and balance them out with the estimated query size. For example, while OpenAI LLMs have a context window of 8k-32k tokens, Llama3 is limited to 8k tokens. Experiment with different chunk sizes, but typical values should be 100-600, depending on the LLM.
Generate embeddings using NVIDIA Retrieval QA Embedding NIM and NVIDIA AI Endpoints for LangChain and save embeddings to offline vector store in the /embed directory for future re-use
Second stage is to load the embeddings from the vector store and build a RAG using NVIDIAEmbeddings
Create the embeddings model using NVIDIA Retrieval QA Embedding NIM from the API Catalog. This model represents words, phrases, or other entities as vectors of numbers and understands the relation between words and phrases. See here for reference: https://build.nvidia.com/nvidia/embed-qa-4
Load documents from vector database using FAISS
Create a ConversationalRetrievalChain chain using a local NIM. We'll use the Llama3 8B NIM we created and deployed locally, add memory for chat history, and connect to the vector store via the embedding model. See here for reference: https://python.langchain.com/docs/modules/chains/popular/chat_vector_db#conversationalretrievalchain-with-streaming-to-stdout
Now try asking a question about Triton with the simpler chain. Compare the answer to the result with previous complex chain model
Ask another question about Triton
Finally showcase chat capabilites by asking a question about the previous query