Llama3 Rag

agentic-aiagenticagentsgenaiAIhaystack-cookbookgenai-usecaseshaystack-ainotebooksPythonragai-tools

🏆🎬 RAG with Llama 3.1 and Haystack

      

Simple RAG example on the Oscars using Llama 3.1 open models and the Haystack LLM framework.

Installation

[ ]

Authorization

[ ]
Your Hugging Face token··········

RAG with Llama-3.1-8B-Instruct (about the Oscars) 🏆🎬

[ ]

Load data from Wikipedia

[ ]
[ ]

Indexing Pipeline

[ ]
[ ]
[ ]
<haystack.core.pipeline.pipeline.Pipeline object at 0x7fcc409ea4d0>
,🚅 Components
,  - splitter: DocumentSplitter
,  - embedder: SentenceTransformersDocumentEmbedder
,  - writer: DocumentWriter
,🛤️ Connections
,  - splitter.documents -> embedder.documents (List[Document])
,  - embedder.documents -> writer.documents (List[Document])
[ ]
/usr/local/lib/python3.10/dist-packages/sentence_transformers/SentenceTransformer.py:174: FutureWarning: The `use_auth_token` argument is deprecated and will be removed in v3 of SentenceTransformers.
  warnings.warn(
/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_token.py:81: UserWarning: 
Access to the secret `HF_TOKEN` has not been granted on this notebook.
You will not be requested again.
Please restart the session if you want to be prompted again.
  warnings.warn(
Batches:   0%|          | 0/1 [00:00<?, ?it/s]
{'writer': {'documents_written': 12}}

RAG Pipeline

[ ]

Here, we use the HuggingFaceLocalChatGenerator, loading the model in Colab with 4-bit quantization.

[ ]
config.json:   0%|          | 0.00/855 [00:00<?, ?B/s]
The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.
model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]
Downloading shards:   0%|          | 0/4 [00:00<?, ?it/s]
model-00001-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]
model-00002-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]
model-00003-of-00004.safetensors:   0%|          | 0.00/4.92G [00:00<?, ?B/s]
model-00004-of-00004.safetensors:   0%|          | 0.00/1.17G [00:00<?, ?B/s]
Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]
generation_config.json:   0%|          | 0.00/184 [00:00<?, ?B/s]
tokenizer_config.json:   0%|          | 0.00/50.9k [00:00<?, ?B/s]
tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]
special_tokens_map.json:   0%|          | 0.00/296 [00:00<?, ?B/s]
[ ]

Let's ask some questions!

[ ]
[ ]
Batches:   0%|          | 0/1 [00:00<?, ?it/s]
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
[ ]
Batches:   0%|          | 0/1 [00:00<?, ?it/s]
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
[ ]
Batches:   0%|          | 0/1 [00:00<?, ?it/s]
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
[ ]
Batches:   0%|          | 0/1 [00:00<?, ?it/s]
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
[ ]
Batches:   0%|          | 0/1 [00:00<?, ?it/s]
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.

This is a simple demo. We can improve the RAG Pipeline in several ways, including better preprocessing the input.

To use Llama 3 models in Haystack, you also have other options:

  • LlamaCppGenerator and OllamaGenerator: using the GGUF quantized format, these solutions are ideal to run LLMs on standard machines (even without GPUs).
  • HuggingFaceAPIChatGenerator, which allows you to query a the Hugging Face API, a local TGI container or a (paid) HF Inference Endpoint. TGI is a toolkit for efficiently deploying and serving LLMs in production.
  • vLLM via OpenAIChatGenerator: high-throughput and memory-efficient inference and serving engine for LLMs.

(Notebook by Stefano Fiorucci)