Notebooks
d
deepset
Zephyr 7b Beta For Rag

Zephyr 7b Beta For Rag

agentic-aiagenticagentsgenaiAIhaystack-cookbookgenai-usecaseshaystack-ainotebooksPythonragai-tools

RAG pipelines with Haystack + Zephyr 7B Beta 🪁

Notebook by Stefano Fiorucci and Tuana Celik

We are going to build a nice Retrieval Augmented Generation pipeline for Rock music, using the 🏗️ Haystack LLM orchestration framework and a good LLM: 💬 Zephyr 7B Beta (fine-tuned version of Mistral 7B V.01 that focuses on helpfulness and outperforms many larger models on the MT-Bench and AlpacaEval benchmarks)

Install dependencies

  • wikipedia is needed to download data from Wikipedia
  • haystack-ai is the Haystack package
  • sentence_transformers is needed for embeddings
  • transformers is needed to use open-source LLMs
  • accelerate and bitsandbytes are required to use quantized versions of these models (with smaller memory footprint)
[ ]
[ ]

Load data from Wikipedia

We are going to download the Wikipedia pages related to some Rock bands, using the python library wikipedia.

These pages are converted into Haystack Documents

[ ]
[ ]

The Indexing Pipeline

[ ]

We will save our final Documents in an InMemoryDocumentStore, a simple database which lives in memory.

[ ]

Our indexing Pipeline transform the original Documents and save them in the Document Store.

It consists of several components:

  • DocumentCleaner: performs a basic cleaning of the Documents
  • DocumentSplitter: chunks each Document into smaller pieces (more appropriate for semantic search and RAG)
  • SentenceTransformersDocumentEmbedder:
    • represent each Document as a vector (capturing its meaning).
    • we choose a good but not too big model from MTEB leaderboard.
    • Also the metadata title is embedded, because it contains relevant information (metadata_fields_to_embed parameter).
    • We use the GPU for this expensive operation (device parameter).
  • DocumentWriter just saves the Documents in the Document Store
[ ]

Let's draw the indexing pipeline

[ ]
Output

We finally run the indexing pipeline

[ ]
.gitattributes:   0%|          | 0.00/1.52k [00:00<?, ?B/s]
1_Pooling/config.json:   0%|          | 0.00/191 [00:00<?, ?B/s]
README.md:   0%|          | 0.00/67.9k [00:00<?, ?B/s]
config.json:   0%|          | 0.00/619 [00:00<?, ?B/s]
model.safetensors:   0%|          | 0.00/670M [00:00<?, ?B/s]
onnx/config.json:   0%|          | 0.00/632 [00:00<?, ?B/s]
model.onnx:   0%|          | 0.00/1.34G [00:00<?, ?B/s]
onnx/special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]
onnx/tokenizer.json:   0%|          | 0.00/712k [00:00<?, ?B/s]
onnx/tokenizer_config.json:   0%|          | 0.00/342 [00:00<?, ?B/s]
onnx/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]
pytorch_model.bin:   0%|          | 0.00/670M [00:00<?, ?B/s]
sentence_bert_config.json:   0%|          | 0.00/57.0 [00:00<?, ?B/s]
special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]
tokenizer.json:   0%|          | 0.00/712k [00:00<?, ?B/s]
tokenizer_config.json:   0%|          | 0.00/342 [00:00<?, ?B/s]
vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]
modules.json:   0%|          | 0.00/385 [00:00<?, ?B/s]
Batches:   0%|          | 0/49 [00:00<?, ?it/s]
{'writer': {'documents_written': 1554}}

Let's inspect the total number of chunked Documents and examine a Document

[ ]
1537
[ ]
{'title': 'Audioslave',
, 'url': 'https://en.wikipedia.org/wiki/Audioslave',
, 'source_id': 'e3deff3d39ef107e8b0d69415ea61644b73175086cfbeee03d5f5d6946619fcf'}
[ ]
Document(id=3ca9785f81fb9fb0700f794b1fd2355626824599ecbce435e6f5e3babb05facc, content: 'Audioslave was an American rock supergroup formed in Glendale, California, in 2001. The four-piece b...', meta: {'title': 'Audioslave', 'url': 'https://en.wikipedia.org/wiki/Audioslave', 'source_id': 'e3deff3d39ef107e8b0d69415ea61644b73175086cfbeee03d5f5d6946619fcf'}, embedding: vector of size 1024)
1024

The RAG Pipeline

HuggingFaceLocalGenerator with zephyr-7b-beta

  • To load and manage Open Source LLMs in Haystack, we can use the HuggingFaceLocalGenerator.

  • The LLM we choose is Zephyr 7B Beta, a fine-tuned version of Mistral 7B V.01 that focuses on helpfulness and outperforms many larger models on the MT-Bench and AlpacaEval benchmarks; the model was fine-tuned by the Hugging Face team.

  • Since we are using a free Colab instance (with limited resources), we load the model using 4-bit quantization (passing the appropriate huggingface_pipeline_kwargs to our Generator). For an introduction to Quantization in Hugging Face Transformers, you can read this simple blog post.

[ ]
[ ]

Let's try the model...

[ ]
/usr/local/lib/python3.10/dist-packages/transformers/generation/utils.py:1473: UserWarning: You have modified the pretrained model configuration to control generation. This is a deprecated strategy to control generation and will be removed soon, in a future version. Please use and modify the model generation configuration (see https://huggingface.co/docs/transformers/generation_strategies#default-text-generation-configuration )
  warnings.warn(

Ok, nice!

PromptBuilder

It's a component that renders a prompt from a template string using Jinja2 engine.

Let's setup our prompt builder, with a format like the following (appropriate for Zephyr):

"<|system|>\nSYSTEM MESSAGE</s>\n<|user|>\nUSER MESSAGE</s>\n<|assistant|>\n"

[ ]

Let's create the RAG pipeline

[ ]

Our RAG Pipeline finds Documents relevant to the user query and pass them to the LLM to generate a grounded answer.

It consists of several components:

  • SentenceTransformersTextEmbedder: represent the query as a vector (capturing its meaning).
  • InMemoryEmbeddingRetriever: finds the (top 5) Documents that are most similar to the query vector
  • PromptBuilder
  • HuggingFaceLocalGenerator
[ ]

Visualize our pipeline!

[ ]
Output

We create an utility function that runs the RAG pipeline and nicely prints the answer.

[ ]

Let's try our RAG pipeline...

[ ]
Batches:   0%|          | 0/1 [00:00<?, ?it/s]
/usr/local/lib/python3.10/dist-packages/transformers/generation/utils.py:1473: UserWarning: You have modified the pretrained model configuration to control generation. This is a deprecated strategy to control generation and will be removed soon, in a future version. Please use and modify the model generation configuration (see https://huggingface.co/docs/transformers/generation_strategies#default-text-generation-configuration )
  warnings.warn(
[ ]
Batches:   0%|          | 0/1 [00:00<?, ?it/s]

More questions to try...

[ ]
[ ]
Who are the members of Green Day?
Batches:   0%|          | 0/1 [00:00<?, ?it/s]
[ ]
Was Ozzy Osbourne part of Blink 182?
Batches:   0%|          | 0/1 [00:00<?, ?it/s]