deepset Multilingual Rag Podcast

Multilingual Rag Podcast

agentic-aiagenticagentsgenaiAIhaystack-cookbookgenai-usecaseshaystack-ainotebooksPythonragai-tools

alph-notebooks/haystack-cookbook / multilingual_rag_podcast.ipynb

Export

Run Notebooks

Contents

No cells yet

Add cells to see them here

🇮🇹🇬🇧 Multilingual RAG from a 🎧 podcast

Notebook by Stefano Fiorucci

This notebook shows how to create a multilingual Retrieval Augmented Generation application, starting from a podcast.

🧰 Stack:

Haystack LLM framework
OpenAI Whisper model for audio transcription
Qdrant vector database
multilingual embedding model: multilingual-e5-large
multilingual LLM: Mistral Small

Installation

[ ]

Podcast transcription

download the audio from Youtube using pytube
transcribe it locally using Haystack's LocalWhisperTranscriber with the whisper-small model. We could use bigger models, which take longer to transcribe. We could also call the paid OpenAI API, using RemoteWhisperTranscriber.

Since the transcription takes some time (about 10 minutes), I commented out the following code and will provide the transcription.

[2]

[ ]

Indexing pipeline

Create an Indexing pipeline that stores chunks of the transcript in the Qdrant vector database.

TextFileToDocument converts the transcript into a Haystack Document.
DocumentSplitter divides the original Document into smaller chunks.
SentenceTransformersDocumentEmbedder computes embeddings(=vector representations) of Documents using a multilingual model, to allow semantic retrieval
DocumentWriter stores the Documents in Qdrant

[4]

--2023-12-31 14:25:00--  https://raw.githubusercontent.com/anakin87/mistral-haystack/main/data/podcast_transcript_whisper_small.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 61083 (60K) [text/plain]
Saving to: ‘podcast_transcript_whisper_small.txt’

podcast_transcript_ 100%[===================>]  59.65K  --.-KB/s    in 0.01s   

2023-12-31 14:25:00 (5.46 MB/s) - ‘podcast_transcript_whisper_small.txt’ saved [61083/61083]

 Ciao e benvenuti nella puntata 183 del Pointer Podcast, torniamo oggi con degli ospiti, ma prima vi introduto Eugenio, ciao Eugenio. Ciao Luca. Come va? Tutto bene? Tutto bene, tutto bene. Oggi abbiamo due ospiti che arrivano dalla stessa azienda, che è una azienda che produce una libreria che pro

[ ]

[7]

.gitattributes:   0%|          | 0.00/1.63k [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/201 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/160k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/690 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.24G [00:00<?, ?B/s]

onnx/config.json:   0%|          | 0.00/688 [00:00<?, ?B/s]

model.onnx:   0%|          | 0.00/546k [00:00<?, ?B/s]

model.onnx_data:   0%|          | 0.00/2.24G [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

onnx/special_tokens_map.json:   0%|          | 0.00/280 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.1M [00:00<?, ?B/s]

onnx/tokenizer_config.json:   0%|          | 0.00/418 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/2.24G [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/57.0 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/280 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.1M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/418 [00:00<?, ?B/s]

modules.json:   0%|          | 0.00/387 [00:00<?, ?B/s]

Batches:   0%|          | 0/2 [00:00<?, ?it/s]

100it [00:00, 529.46it/s]

[8]

RAG pipeline

Finally our RAG pipeline: from an Italian podcast 🇮🇹🎧 to answering questions in English 🇬🇧

SentenceTransformersTextEmbedder transforms the query into a vector that captures its semantics, to allow vector retrieval
QdrantRetriever compares the query and Document embeddings and fetches the Documents most relevant to the query.
ChatPromptBuilder prepares the prompt for the LLM: renders a prompt template and fill in variable values.
MistralChatGenerator allows using Mistral LLMs. Read their Quickstart to get an API key.

[ ]

{'replies': [ChatMessage(_role=<ChatRole.ASSISTANT: 'assistant'>, _content=[TextContent(text='Oh, absolutely! Let’s dive into why **Vim** is the *ultimate* IDE—because it’s not just an editor, it’s a *way of life*. Here’s the fun breakdown:
,
,### **1. Vim is Like a Ninja’s Sword—Lightning-Fast & Deadly**
,- **No mouse? No problem!** Your hands never leave the keyboard, making you code like a samurai slicing through bugs...')], _name=None, _meta={'model': 'mistral-small-latest', 'index': 0, 'finish_reason': 'stop', 'usage': {'completion_tokens': 712, 'prompt_tokens': 15, 'total_tokens': 727, 'completion_tokens_details': None, 'prompt_tokens_details': None}})]}

[ ]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

(' The Pointer Podcast is an Italian language podcast available on Apple '
 'Podcast, Google Podcast, and Spotify. It covers various topics, including '
 'NLP (Natural Language Processing) and LLM (Large Language Model). The hosts '
 'interview experts in the field, and listeners can contact the podcast via '
 'email or social media.')

✨ Nice!

[ ]

Try our multilingual RAG application!

[17]

[18]

[19]

Will open source models achieve the quality of closed ones?

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

(' The documents suggest that open source models are reaching the performance '
 'levels of closed models, and there is a paper from Aggin Face that '
 'demonstrates a technique for distilling the characteristics of larger models '
 'into smaller ones using human preference alignment with minimal human '
 'supervision. This allows for the creation of smaller models that replicate '
 'some of the capabilities of larger models, which can be useful for the end '
 'user. The report and recipe for this technique have been released and can be '
 'replicated with a relatively small budget. The documents also indicate that '
 'open source models are having an important impact on the field, not just '
 'from big companies like Open AI and Microsoft, but also from smaller '
 'companies or individuals releasing open source models. However, there is a '
 'tension between open source and proprietary models that is not likely to be '
 'resolved soon. The quality and variety of open source models continue to '
 'improve, despite the cost and difficulty of training and fine-tuning them.')