deepset Llama3 Rag

Llama3 Rag

agentic-aiagenticagentsgenaiAIhaystack-cookbookgenai-usecaseshaystack-ainotebooksPythonragai-tools

alph-notebooks/haystack-cookbook / llama3_rag.ipynb

Export

Run Notebooks

Contents

No cells yet

Add cells to see them here

🏆🎬 RAG with Llama 3.1 and Haystack

Simple RAG example on the Oscars using Llama 3.1 open models and the Haystack LLM framework.

Installation

[ ]

Authorization

you need an Hugging Face account
you need to accept Meta conditions here: https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct and wait for the authorization

[ ]

Your Hugging Face token··········

RAG with Llama-3.1-8B-Instruct (about the Oscars) 🏆🎬

[ ]

Load data from Wikipedia

[ ]

Indexing Pipeline

[ ]

<haystack.core.pipeline.pipeline.Pipeline object at 0x7fcc409ea4d0>
,🚅 Components
,  - splitter: DocumentSplitter
,  - embedder: SentenceTransformersDocumentEmbedder
,  - writer: DocumentWriter
,🛤️ Connections
,  - splitter.documents -> embedder.documents (List[Document])
,  - embedder.documents -> writer.documents (List[Document])

[ ]

/usr/local/lib/python3.10/dist-packages/sentence_transformers/SentenceTransformer.py:174: FutureWarning: The `use_auth_token` argument is deprecated and will be removed in v3 of SentenceTransformers.
  warnings.warn(
/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_token.py:81: UserWarning: 
Access to the secret `HF_TOKEN` has not been granted on this notebook.
You will not be requested again.
Please restart the session if you want to be prompted again.
  warnings.warn(

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

{'writer': {'documents_written': 12}}

RAG Pipeline

[ ]

Here, we use the HuggingFaceLocalChatGenerator, loading the model in Colab with 4-bit quantization.

[ ]

config.json:   0%|          | 0.00/855 [00:00<?, ?B/s]

The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.

model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/4 [00:00<?, ?it/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/1.17G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/184 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/50.9k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/296 [00:00<?, ?B/s]

[ ]

Let's ask some questions!

[ ]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.

[ ]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.

[ ]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.

[ ]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.

[ ]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.

This is a simple demo. We can improve the RAG Pipeline in several ways, including better preprocessing the input.

To use Llama 3 models in Haystack, you also have other options:

LlamaCppGenerator and OllamaGenerator: using the GGUF quantized format, these solutions are ideal to run LLMs on standard machines (even without GPUs).
HuggingFaceAPIChatGenerator, which allows you to query a the Hugging Face API, a local TGI container or a (paid) HF Inference Endpoint. TGI is a toolkit for efficiently deploying and serving LLMs in production.
vLLM via OpenAIChatGenerator: high-throughput and memory-efficient inference and serving engine for LLMs.

(Notebook by Stefano Fiorucci)