Notebooks
M
Meta Llama
RAG Chatbot Example

RAG Chatbot Example

llamacustomerservice_chatbotsRAG_chatbotAIvllmmachine-learningend-to-end-use-casesllama2LLMllama-cookbookPythonfinetuningpytorchlangchain

Building a Meta Llama 3 chatbot with Retrieval Augmented Generation (RAG)

This notebook shows a complete example of how to build a Meta Llama 3 chatbot hosted on your browser that can answer questions based on your own data. We'll cover:

  • The deployment process of Meta Llama 3 8B with the Text-generation-inference framework as an API server
  • A chatbot example built with Gradio and wired to the server
  • Adding RAG capability with Meta Llama 3 specific knowledge based on our Getting Started guide

RAG Architecture

LLMs have unprecedented capabilities in NLU (Natural Language Understanding) & NLG (Natural Language Generation), but they have a knowledge cutoff date, and are only trained on publicly available data before that date.

RAG, invented by Meta in 2020, is one of the most popular methods to augment LLMs. RAG allows enterprises to keep sensitive data on-prem and get more relevant answers from generic models without fine-tuning models for specific roles.

RAG is a method that:

  • Retrieves data from outside a foundation model
  • Augments your questions or prompts to LLMs by adding the retrieved relevant data as context
  • Allows LLMs to answer questions about your own data, or data not publicly available when LLMs were trained
  • Greatly reduces the hallucination in model's response generation

The following diagram shows the general RAG components and process:

image.png

How to Develop a RAG Powered Meta Llama 3 Chatbot

The easiest way to develop RAG-powered Meta Llama 3 chatbots is to use frameworks such as LangChain and LlamaIndex, two leading open-source frameworks for building LLM apps. Both offer convenient APIs for implementing RAG with Meta Llama 3 including:

  • Load and split documents
  • Embed and store document splits
  • Retrieve the relevant context based on the user query
  • Call Meta Llama 3 with query and context to generate the answer

LangChain is a more general purpose and flexible framework for developing LLM apps with RAG capabilities, while LlamaIndex as a data framework focuses on connecting custom data sources to LLMs. The integration of the two may provide the best performant and effective solution to building real world RAG apps.
In our example, for simplicifty, we will use LangChain alone with locally stored PDF data.

Install Dependencies

For this demo, we will be using the Gradio for chatbot UI, Text-generation-inference framework for model serving.
For vector storage and similarity search, we will be using FAISS.
In this example, we will be running everything in a AWS EC2 instance (i.e. g5.2xlarge). g5.2xlarge features one A10G GPU. We recommend running this notebook with at least one GPU equivalent to A10G with at least 16GB video memory.
There are certain techniques to downsize the Meta Llama 3 8B model, so it can fit into smaller GPUs. But it is out of scope here.

First, let's install all dependencies with PIP. We also recommend you start a dedicated Conda environment for better package management

[ ]

Data Processing

First run all the imports and define the path of the data and vector storage after processing.
For the data, we will be using a raw pdf crawled from Meta Llama 3 Getting Started guide on Meta AI website.

[1]

Then we use the PyPDFDirectoryLoader to load the entire directory. You can also use PyPDFLoader for loading one single file.

[2]

Check the length and content of the doc to ensure we have loaded the right document with number of pages as 37.

[3]
37 11/8/23, 2:00 PM Getting started with Llama 2 - AI at Meta
https://ai.meta.com/llama/get-started/ 1/

Split the loaded documents into smaller chunks.
RecursiveCharacterTextSplitter is one common splitter that splits long pieces of text into smaller, semantically meaningful chunks.
Other splitters include:

  • SpacyTextSplitter
  • NLTKTextSplitter
  • SentenceTransformersTokenTextSplitter
  • CharacterTextSplitter
[4]
103 page_content='11/8/23, 2:00 PM Getting started with Llama 2 - AI at Meta\nhttps://ai.meta.com/llama/get-started/ 1/37\nLlama 2 Get Started FAQ Download the Model\nQuick setup and how-to guide\nGetting started\nwith Llama\nWelcome to the getting started guide for Llama.\nThis guide provides information and resources to help you set up Llama including how to access the model,\nhosting, how-to and integration guides. Additionally , you will find supplemental materials to further assist you while\nbuilding with Llama.' metadata={'source': 'data/Llama Getting Started Guide.pdf', 'page': 0}

Note that we have set chunk_size to 500 and chunk_overlap to 10. In the spliting, these two parameters can directly affects the quality of the LLM's answers.
Here is a good guide on how you should carefully set these two parameters.

Next we will need to choose an embedding model for our splited documents.
Embeddings are numerial representations of text. The default embedding model in HuggingFace Embeddings is sentence-transformers/all-mpnet-base-v2 with 768 dimension. Below we use a smaller model all-MiniLM-L6-v2 with dimension 384 so indexing runs faster.

[6]
modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]
config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]
README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]
sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]
config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]
model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]
tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]
vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]
tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]
special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]
1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Lastly, with splits and choice of the embedding model ready, we want to index them and store all the split chunks as embeddings into the vector storage.

Vector stores are databases storing embeddings. There're at least 60 vector stores supported by LangChain, and two of the most popular open source ones are:

  • Chroma: a light-weight and in memory so it's easy to get started with and use for local development.
  • FAISS (Facebook AI Similarity Search): a vector store that supports search in vectors that may not fit in RAM and is appropriate for production use.

Since we are running on a EC2 instance with abundant CPU resources and RAM, we will use FAISS in this example. Note that FAISS can also run on GPUs, where some of the most useful algorithms are implemented there. In that case, install faiss-gpu package with PIP instead.

[7]

Once you saved database into local path. You can find them as index.faiss and index.pkl. In the chatbot example, you can then load this database from local and plug it into our retrival process.

Model Serving

In this example, we will be deploying a Meta Llama 3 8B chat HuggingFace model with the Text-generation-inference framework on-permises.
This would allow us to directly wire the API server with our chatbot.
There are alternative solutions to deploy Meta Llama 3 models on-permises as your local API server.
You can find our complete guide here.

In a separate terminal, run commands below to launch an API server with TGI. This will download model artifacts and store them locally, while launching at the desire port on your localhost. In our case, this is port 8080

[ ]

Once we have the API server up and running, we can run a simple curl command to validate our model is working as expected.

[ ]

Building the Chatbot UI

Now we are ready to build the chatbot UI to wire up RAG data and API server. In our example we will be using Gradio to build the Chatbot UI.
Gradio is an open-source Python library that is used to build machine learning and data science demos and web applications. It had been widely used by the community and HuggingFace also used Gradio to build their Chatbots. Other alternatives are:

Again, we start by adding all the imports, paths, constants and set LangChain in debug mode, so it shows clear actions within the chain process.

[8]

Then we load the FAISS vector store

[11]

Now we create a TGI llm instance and wire to the API serving port on localhost

[12]
/opt/conda/envs/pytorch/lib/python3.10/site-packages/langchain_core/_api/deprecation.py:119: LangChainDeprecationWarning: The class `HuggingFaceTextGenInference` was deprecated in LangChain 0.0.21 and will be removed in 0.2.0. Use HuggingFaceEndpoint instead.
  warn_deprecated(
/opt/conda/envs/pytorch/lib/python3.10/site-packages/pydantic/_internal/_fields.py:127: UserWarning: Field "model_id" has conflict with protected namespace "model_".

You may be able to resolve this warning by setting `model_config['protected_namespaces'] = ()`.
  warnings.warn(

Next, we define the retriever and template for our RetrivalQA chain. For each call of the RetrievalQA, LangChain performs a semantic similarity search of the query in the vector database, then passes the search results as the context to Llama to answer the query about the data stored in the vector database.
Whereas for the template, this defines the format of the question along with context that we will be sent into Llama for generation. In general, Meta Llama 3 has special prompt format to handle special tokens. In some cases, the serving framework might already have taken care of it. Otherwise, you will need to write customized template to properly handle that.

[19]

Lastly, we can define the retrieval chain for QA

[20]

Now we should have a working chain for QA. Let's test it out before wire it up with UI blocks.

[ ]

After confirming the validity, we can start building the UI. Before we define the gradio blocks, let's first define the callback streams that we will use later for the streaming feature.
This callback handler will put streaming LLM responses to a queue for gradio UI to render on the fly.

[14]

Now we can define the gradio UI blocks.
Since we will need to define the UI and handlers in the same place, this will be a large chunk of code. We will add comments in the code for explanation.

[15]

Lastly, we can launch this demo on our localhost with the command below.

[16]
Running on local URL:  http://0.0.0.0:7860

To create a public link, set `share=True` in `launch()`.
 

Gradio will default the launch port to 7860. You can select which port it should launch on as needed.
Once launched, in the notebook or a browser with URL http://0.0.0.0:7860, you should see the UI.
Things to try in the chatbot demo:

  • Asking specific questions related to the Meta Llama 3 Getting Started Guide
  • Adjust parameters such as max new token generated
  • Switching to another Llama model with another container launched in a separate terminal

Once finished testing, make sure you close the demo by running the command below to release the port.

[ ]