Llama3 Rag
agentic-aiagenticagentsgenaiAIhaystack-cookbookgenai-usecaseshaystack-ainotebooksPythonragai-tools
Export
🏆🎬 RAG with Llama 3.1 and Haystack

Simple RAG example on the Oscars using Llama 3.1 open models and the Haystack LLM framework.
Installation
[ ]
Authorization
- you need an Hugging Face account
- you need to accept Meta conditions here: https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct and wait for the authorization
[ ]
Your Hugging Face token··········
RAG with Llama-3.1-8B-Instruct (about the Oscars) 🏆🎬
[ ]
Load data from Wikipedia
[ ]
[ ]
Indexing Pipeline
[ ]
[ ]
[ ]
<haystack.core.pipeline.pipeline.Pipeline object at 0x7fcc409ea4d0> ,🚅 Components , - splitter: DocumentSplitter , - embedder: SentenceTransformersDocumentEmbedder , - writer: DocumentWriter ,🛤️ Connections , - splitter.documents -> embedder.documents (List[Document]) , - embedder.documents -> writer.documents (List[Document])
[ ]
/usr/local/lib/python3.10/dist-packages/sentence_transformers/SentenceTransformer.py:174: FutureWarning: The `use_auth_token` argument is deprecated and will be removed in v3 of SentenceTransformers. warnings.warn( /usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_token.py:81: UserWarning: Access to the secret `HF_TOKEN` has not been granted on this notebook. You will not be requested again. Please restart the session if you want to be prompted again. warnings.warn(
Batches: 0%| | 0/1 [00:00<?, ?it/s]
{'writer': {'documents_written': 12}} RAG Pipeline
[ ]
Here, we use the HuggingFaceLocalChatGenerator, loading the model in Colab with 4-bit quantization.
[ ]
config.json: 0%| | 0.00/855 [00:00<?, ?B/s]
The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.
model.safetensors.index.json: 0%| | 0.00/23.9k [00:00<?, ?B/s]
Downloading shards: 0%| | 0/4 [00:00<?, ?it/s]
model-00001-of-00004.safetensors: 0%| | 0.00/4.98G [00:00<?, ?B/s]
model-00002-of-00004.safetensors: 0%| | 0.00/5.00G [00:00<?, ?B/s]
model-00003-of-00004.safetensors: 0%| | 0.00/4.92G [00:00<?, ?B/s]
model-00004-of-00004.safetensors: 0%| | 0.00/1.17G [00:00<?, ?B/s]
Loading checkpoint shards: 0%| | 0/4 [00:00<?, ?it/s]
generation_config.json: 0%| | 0.00/184 [00:00<?, ?B/s]
tokenizer_config.json: 0%| | 0.00/50.9k [00:00<?, ?B/s]
tokenizer.json: 0%| | 0.00/9.09M [00:00<?, ?B/s]
special_tokens_map.json: 0%| | 0.00/296 [00:00<?, ?B/s]
[ ]
Let's ask some questions!
[ ]
[ ]
Batches: 0%| | 0/1 [00:00<?, ?it/s]
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
[ ]
Batches: 0%| | 0/1 [00:00<?, ?it/s]
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
[ ]
Batches: 0%| | 0/1 [00:00<?, ?it/s]
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
[ ]
Batches: 0%| | 0/1 [00:00<?, ?it/s]
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
[ ]
Batches: 0%| | 0/1 [00:00<?, ?it/s]
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
This is a simple demo. We can improve the RAG Pipeline in several ways, including better preprocessing the input.
To use Llama 3 models in Haystack, you also have other options:
- LlamaCppGenerator and OllamaGenerator: using the GGUF quantized format, these solutions are ideal to run LLMs on standard machines (even without GPUs).
- HuggingFaceAPIChatGenerator, which allows you to query a the Hugging Face API, a local TGI container or a (paid) HF Inference Endpoint. TGI is a toolkit for efficiently deploying and serving LLMs in production.
- vLLM via OpenAIChatGenerator: high-throughput and memory-efficient inference and serving engine for LLMs.
(Notebook by Stefano Fiorucci)