Zephyr 7b Beta For Rag
RAG pipelines with Haystack + Zephyr 7B Beta 🪁
Notebook by Stefano Fiorucci and Tuana Celik
We are going to build a nice Retrieval Augmented Generation pipeline for Rock music, using the 🏗️ Haystack LLM orchestration framework and a good LLM: 💬 Zephyr 7B Beta (fine-tuned version of Mistral 7B V.01 that focuses on helpfulness and outperforms many larger models on the MT-Bench and AlpacaEval benchmarks)
Install dependencies
wikipediais needed to download data from Wikipediahaystack-aiis the Haystack packagesentence_transformersis needed for embeddingstransformersis needed to use open-source LLMsaccelerateandbitsandbytesare required to use quantized versions of these models (with smaller memory footprint)
Load data from Wikipedia
We are going to download the Wikipedia pages related to some Rock bands, using the python library wikipedia.
These pages are converted into Haystack Documents
The Indexing Pipeline
We will save our final Documents in an InMemoryDocumentStore, a simple database which lives in memory.
Our indexing Pipeline transform the original Documents and save them in the Document Store.
It consists of several components:
DocumentCleaner: performs a basic cleaning of the DocumentsDocumentSplitter: chunks each Document into smaller pieces (more appropriate for semantic search and RAG)SentenceTransformersDocumentEmbedder:- represent each Document as a vector (capturing its meaning).
- we choose a good but not too big model from MTEB leaderboard.
- Also the metadata
titleis embedded, because it contains relevant information (metadata_fields_to_embedparameter). - We use the GPU for this expensive operation (
deviceparameter).
DocumentWriterjust saves the Documents in the Document Store
Let's draw the indexing pipeline
We finally run the indexing pipeline
.gitattributes: 0%| | 0.00/1.52k [00:00<?, ?B/s]
1_Pooling/config.json: 0%| | 0.00/191 [00:00<?, ?B/s]
README.md: 0%| | 0.00/67.9k [00:00<?, ?B/s]
config.json: 0%| | 0.00/619 [00:00<?, ?B/s]
model.safetensors: 0%| | 0.00/670M [00:00<?, ?B/s]
onnx/config.json: 0%| | 0.00/632 [00:00<?, ?B/s]
model.onnx: 0%| | 0.00/1.34G [00:00<?, ?B/s]
onnx/special_tokens_map.json: 0%| | 0.00/125 [00:00<?, ?B/s]
onnx/tokenizer.json: 0%| | 0.00/712k [00:00<?, ?B/s]
onnx/tokenizer_config.json: 0%| | 0.00/342 [00:00<?, ?B/s]
onnx/vocab.txt: 0%| | 0.00/232k [00:00<?, ?B/s]
pytorch_model.bin: 0%| | 0.00/670M [00:00<?, ?B/s]
sentence_bert_config.json: 0%| | 0.00/57.0 [00:00<?, ?B/s]
special_tokens_map.json: 0%| | 0.00/125 [00:00<?, ?B/s]
tokenizer.json: 0%| | 0.00/712k [00:00<?, ?B/s]
tokenizer_config.json: 0%| | 0.00/342 [00:00<?, ?B/s]
vocab.txt: 0%| | 0.00/232k [00:00<?, ?B/s]
modules.json: 0%| | 0.00/385 [00:00<?, ?B/s]
Batches: 0%| | 0/49 [00:00<?, ?it/s]
{'writer': {'documents_written': 1554}} Let's inspect the total number of chunked Documents and examine a Document
1537
{'title': 'Audioslave',
, 'url': 'https://en.wikipedia.org/wiki/Audioslave',
, 'source_id': 'e3deff3d39ef107e8b0d69415ea61644b73175086cfbeee03d5f5d6946619fcf'} Document(id=3ca9785f81fb9fb0700f794b1fd2355626824599ecbce435e6f5e3babb05facc, content: 'Audioslave was an American rock supergroup formed in Glendale, California, in 2001. The four-piece b...', meta: {'title': 'Audioslave', 'url': 'https://en.wikipedia.org/wiki/Audioslave', 'source_id': 'e3deff3d39ef107e8b0d69415ea61644b73175086cfbeee03d5f5d6946619fcf'}, embedding: vector of size 1024)
1024
The RAG Pipeline
HuggingFaceLocalGenerator with zephyr-7b-beta
-
To load and manage Open Source LLMs in Haystack, we can use the
HuggingFaceLocalGenerator. -
The LLM we choose is Zephyr 7B Beta, a fine-tuned version of Mistral 7B V.01 that focuses on helpfulness and outperforms many larger models on the MT-Bench and AlpacaEval benchmarks; the model was fine-tuned by the Hugging Face team.
-
Since we are using a free Colab instance (with limited resources), we load the model using 4-bit quantization (passing the appropriate
huggingface_pipeline_kwargsto our Generator). For an introduction to Quantization in Hugging Face Transformers, you can read this simple blog post.
Let's try the model...
/usr/local/lib/python3.10/dist-packages/transformers/generation/utils.py:1473: UserWarning: You have modified the pretrained model configuration to control generation. This is a deprecated strategy to control generation and will be removed soon, in a future version. Please use and modify the model generation configuration (see https://huggingface.co/docs/transformers/generation_strategies#default-text-generation-configuration ) warnings.warn(
Ok, nice!
PromptBuilder
It's a component that renders a prompt from a template string using Jinja2 engine.
Let's setup our prompt builder, with a format like the following (appropriate for Zephyr):
"<|system|>\nSYSTEM MESSAGE</s>\n<|user|>\nUSER MESSAGE</s>\n<|assistant|>\n"
Let's create the RAG pipeline
Our RAG Pipeline finds Documents relevant to the user query and pass them to the LLM to generate a grounded answer.
It consists of several components:
SentenceTransformersTextEmbedder: represent the query as a vector (capturing its meaning).InMemoryEmbeddingRetriever: finds the (top 5) Documents that are most similar to the query vectorPromptBuilderHuggingFaceLocalGenerator
Visualize our pipeline!
We create an utility function that runs the RAG pipeline and nicely prints the answer.
Let's try our RAG pipeline...
Batches: 0%| | 0/1 [00:00<?, ?it/s]
/usr/local/lib/python3.10/dist-packages/transformers/generation/utils.py:1473: UserWarning: You have modified the pretrained model configuration to control generation. This is a deprecated strategy to control generation and will be removed soon, in a future version. Please use and modify the model generation configuration (see https://huggingface.co/docs/transformers/generation_strategies#default-text-generation-configuration ) warnings.warn(
Batches: 0%| | 0/1 [00:00<?, ?it/s]
More questions to try...
Who are the members of Green Day?
Batches: 0%| | 0/1 [00:00<?, ?it/s]
Was Ozzy Osbourne part of Blink 182?
Batches: 0%| | 0/1 [00:00<?, ?it/s]