deepset Zephyr 7b Beta For Rag

Zephyr 7b Beta For Rag

agentic-aiagenticagentsgenaiAIhaystack-cookbookgenai-usecaseshaystack-ainotebooksPythonragai-tools

alph-notebooks/haystack-cookbook / zephyr-7b-beta-for-rag.ipynb

Export

Run Notebooks

Contents

No cells yet

Add cells to see them here

RAG pipelines with Haystack + Zephyr 7B Beta 🪁

Notebook by Stefano Fiorucci and Tuana Celik

We are going to build a nice Retrieval Augmented Generation pipeline for Rock music, using the 🏗️ Haystack LLM orchestration framework and a good LLM: 💬 Zephyr 7B Beta (fine-tuned version of Mistral 7B V.01 that focuses on helpfulness and outperforms many larger models on the MT-Bench and AlpacaEval benchmarks)

Install dependencies

wikipedia is needed to download data from Wikipedia
haystack-ai is the Haystack package
sentence_transformers is needed for embeddings
transformers is needed to use open-source LLMs
accelerate and bitsandbytes are required to use quantized versions of these models (with smaller memory footprint)

[ ]

Load data from Wikipedia

We are going to download the Wikipedia pages related to some Rock bands, using the python library wikipedia.

These pages are converted into Haystack Documents

[ ]

The Indexing Pipeline

[ ]

We will save our final Documents in an InMemoryDocumentStore, a simple database which lives in memory.

[ ]

Our indexing Pipeline transform the original Documents and save them in the Document Store.

It consists of several components:

DocumentCleaner: performs a basic cleaning of the Documents
DocumentSplitter: chunks each Document into smaller pieces (more appropriate for semantic search and RAG)
SentenceTransformersDocumentEmbedder:
- represent each Document as a vector (capturing its meaning).
- we choose a good but not too big model from MTEB leaderboard.
- Also the metadata title is embedded, because it contains relevant information (metadata_fields_to_embed parameter).
- We use the GPU for this expensive operation (device parameter).
DocumentWriter just saves the Documents in the Document Store

[ ]

Let's draw the indexing pipeline

[ ]

We finally run the indexing pipeline

[ ]

.gitattributes:   0%|          | 0.00/1.52k [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/191 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/67.9k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/619 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/670M [00:00<?, ?B/s]

onnx/config.json:   0%|          | 0.00/632 [00:00<?, ?B/s]

model.onnx:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

onnx/special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

onnx/tokenizer.json:   0%|          | 0.00/712k [00:00<?, ?B/s]

onnx/tokenizer_config.json:   0%|          | 0.00/342 [00:00<?, ?B/s]

onnx/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/670M [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/57.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/712k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/342 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

modules.json:   0%|          | 0.00/385 [00:00<?, ?B/s]

Batches:   0%|          | 0/49 [00:00<?, ?it/s]

{'writer': {'documents_written': 1554}}

Let's inspect the total number of chunked Documents and examine a Document

[ ]

{'title': 'Audioslave',
, 'url': 'https://en.wikipedia.org/wiki/Audioslave',
, 'source_id': 'e3deff3d39ef107e8b0d69415ea61644b73175086cfbeee03d5f5d6946619fcf'}

[ ]

Document(id=3ca9785f81fb9fb0700f794b1fd2355626824599ecbce435e6f5e3babb05facc, content: 'Audioslave was an American rock supergroup formed in Glendale, California, in 2001. The four-piece b...', meta: {'title': 'Audioslave', 'url': 'https://en.wikipedia.org/wiki/Audioslave', 'source_id': 'e3deff3d39ef107e8b0d69415ea61644b73175086cfbeee03d5f5d6946619fcf'}, embedding: vector of size 1024)
1024

The RAG Pipeline

`HuggingFaceLocalGenerator` with `zephyr-7b-beta`

To load and manage Open Source LLMs in Haystack, we can use the HuggingFaceLocalGenerator.
The LLM we choose is Zephyr 7B Beta, a fine-tuned version of Mistral 7B V.01 that focuses on helpfulness and outperforms many larger models on the MT-Bench and AlpacaEval benchmarks; the model was fine-tuned by the Hugging Face team.
Since we are using a free Colab instance (with limited resources), we load the model using 4-bit quantization (passing the appropriate huggingface_pipeline_kwargs to our Generator). For an introduction to Quantization in Hugging Face Transformers, you can read this simple blog post.

[ ]

Let's try the model...

[ ]

/usr/local/lib/python3.10/dist-packages/transformers/generation/utils.py:1473: UserWarning: You have modified the pretrained model configuration to control generation. This is a deprecated strategy to control generation and will be removed soon, in a future version. Please use and modify the model generation configuration (see https://huggingface.co/docs/transformers/generation_strategies#default-text-generation-configuration )
  warnings.warn(

Ok, nice!

`PromptBuilder`

It's a component that renders a prompt from a template string using Jinja2 engine.

Let's setup our prompt builder, with a format like the following (appropriate for Zephyr):

[ ]

Let's create the RAG pipeline

[ ]

Our RAG Pipeline finds Documents relevant to the user query and pass them to the LLM to generate a grounded answer.

It consists of several components:

SentenceTransformersTextEmbedder: represent the query as a vector (capturing its meaning).
InMemoryEmbeddingRetriever: finds the (top 5) Documents that are most similar to the query vector
PromptBuilder
HuggingFaceLocalGenerator

[ ]

Visualize our pipeline!

[ ]

We create an utility function that runs the RAG pipeline and nicely prints the answer.

[ ]

Let's try our RAG pipeline...

[ ]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

/usr/local/lib/python3.10/dist-packages/transformers/generation/utils.py:1473: UserWarning: You have modified the pretrained model configuration to control generation. This is a deprecated strategy to control generation and will be removed soon, in a future version. Please use and modify the model generation configuration (see https://huggingface.co/docs/transformers/generation_strategies#default-text-generation-configuration )
  warnings.warn(

[ ]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]