Multilingual Rag Podcast
agentic-aiagenticagentsgenaiAIhaystack-cookbookgenai-usecaseshaystack-ainotebooksPythonragai-tools
Export
🇮🇹🇬🇧 Multilingual RAG from a 🎧 podcast
Notebook by Stefano Fiorucci
This notebook shows how to create a multilingual Retrieval Augmented Generation application, starting from a podcast.
🧰 Stack:
- Haystack LLM framework
- OpenAI Whisper model for audio transcription
- Qdrant vector database
- multilingual embedding model: multilingual-e5-large
- multilingual LLM: Mistral Small
Installation
[ ]
Podcast transcription
- download the audio from Youtube using
pytube - transcribe it locally using Haystack's
LocalWhisperTranscriberwith thewhisper-smallmodel. We could use bigger models, which take longer to transcribe. We could also call the paid OpenAI API, usingRemoteWhisperTranscriber.
Since the transcription takes some time (about 10 minutes), I commented out the following code and will provide the transcription.
[2]
[ ]
Indexing pipeline
Create an Indexing pipeline that stores chunks of the transcript in the Qdrant vector database.
TextFileToDocumentconverts the transcript into a Haystack Document.DocumentSplitterdivides the original Document into smaller chunks.SentenceTransformersDocumentEmbeddercomputes embeddings(=vector representations) of Documents using a multilingual model, to allow semantic retrievalDocumentWriterstores the Documents in Qdrant
[4]
--2023-12-31 14:25:00-- https://raw.githubusercontent.com/anakin87/mistral-haystack/main/data/podcast_transcript_whisper_small.txt Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.111.133, ... Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 61083 (60K) [text/plain] Saving to: ‘podcast_transcript_whisper_small.txt’ podcast_transcript_ 100%[===================>] 59.65K --.-KB/s in 0.01s 2023-12-31 14:25:00 (5.46 MB/s) - ‘podcast_transcript_whisper_small.txt’ saved [61083/61083] Ciao e benvenuti nella puntata 183 del Pointer Podcast, torniamo oggi con degli ospiti, ma prima vi introduto Eugenio, ciao Eugenio. Ciao Luca. Come va? Tutto bene? Tutto bene, tutto bene. Oggi abbiamo due ospiti che arrivano dalla stessa azienda, che è una azienda che produce una libreria che pro
[ ]
[ ]
[7]
.gitattributes: 0%| | 0.00/1.63k [00:00<?, ?B/s]
1_Pooling/config.json: 0%| | 0.00/201 [00:00<?, ?B/s]
README.md: 0%| | 0.00/160k [00:00<?, ?B/s]
config.json: 0%| | 0.00/690 [00:00<?, ?B/s]
model.safetensors: 0%| | 0.00/2.24G [00:00<?, ?B/s]
onnx/config.json: 0%| | 0.00/688 [00:00<?, ?B/s]
model.onnx: 0%| | 0.00/546k [00:00<?, ?B/s]
model.onnx_data: 0%| | 0.00/2.24G [00:00<?, ?B/s]
sentencepiece.bpe.model: 0%| | 0.00/5.07M [00:00<?, ?B/s]
onnx/special_tokens_map.json: 0%| | 0.00/280 [00:00<?, ?B/s]
tokenizer.json: 0%| | 0.00/17.1M [00:00<?, ?B/s]
onnx/tokenizer_config.json: 0%| | 0.00/418 [00:00<?, ?B/s]
pytorch_model.bin: 0%| | 0.00/2.24G [00:00<?, ?B/s]
sentence_bert_config.json: 0%| | 0.00/57.0 [00:00<?, ?B/s]
sentencepiece.bpe.model: 0%| | 0.00/5.07M [00:00<?, ?B/s]
special_tokens_map.json: 0%| | 0.00/280 [00:00<?, ?B/s]
tokenizer.json: 0%| | 0.00/17.1M [00:00<?, ?B/s]
tokenizer_config.json: 0%| | 0.00/418 [00:00<?, ?B/s]
modules.json: 0%| | 0.00/387 [00:00<?, ?B/s]
Batches: 0%| | 0/2 [00:00<?, ?it/s]
100it [00:00, 529.46it/s]
[8]
52
RAG pipeline
Finally our RAG pipeline: from an Italian podcast 🇮🇹🎧 to answering questions in English 🇬🇧
SentenceTransformersTextEmbeddertransforms the query into a vector that captures its semantics, to allow vector retrievalQdrantRetrievercompares the query and Document embeddings and fetches the Documents most relevant to the query.ChatPromptBuilderprepares the prompt for the LLM: renders a prompt template and fill in variable values.MistralChatGeneratorallows using Mistral LLMs. Read their Quickstart to get an API key.
[ ]
[ ]
[ ]
{'replies': [ChatMessage(_role=<ChatRole.ASSISTANT: 'assistant'>, _content=[TextContent(text='Oh, absolutely! Let’s dive into why **Vim** is the *ultimate* IDE—because it’s not just an editor, it’s a *way of life*. Here’s the fun breakdown:
,
,### **1. Vim is Like a Ninja’s Sword—Lightning-Fast & Deadly**
,- **No mouse? No problem!** Your hands never leave the keyboard, making you code like a samurai slicing through bugs...')], _name=None, _meta={'model': 'mistral-small-latest', 'index': 0, 'finish_reason': 'stop', 'usage': {'completion_tokens': 712, 'prompt_tokens': 15, 'total_tokens': 727, 'completion_tokens_details': None, 'prompt_tokens_details': None}})]} [ ]
[ ]
[ ]
[ ]
Batches: 0%| | 0/1 [00:00<?, ?it/s]
(' The Pointer Podcast is an Italian language podcast available on Apple '
'Podcast, Google Podcast, and Spotify. It covers various topics, including '
'NLP (Natural Language Processing) and LLM (Large Language Model). The hosts '
'interview experts in the field, and listeners can contact the podcast via '
'email or social media.')
✨ Nice!
[ ]
Try our multilingual RAG application!
[17]
[18]
[19]
Will open source models achieve the quality of closed ones?
Batches: 0%| | 0/1 [00:00<?, ?it/s]
(' The documents suggest that open source models are reaching the performance '
'levels of closed models, and there is a paper from Aggin Face that '
'demonstrates a technique for distilling the characteristics of larger models '
'into smaller ones using human preference alignment with minimal human '
'supervision. This allows for the creation of smaller models that replicate '
'some of the capabilities of larger models, which can be useful for the end '
'user. The report and recipe for this technique have been released and can be '
'replicated with a relatively small budget. The documents also indicate that '
'open source models are having an important impact on the field, not just '
'from big companies like Open AI and Microsoft, but also from smaller '
'companies or individuals releasing open source models. However, there is a '
'tension between open source and proprietary models that is not likely to be '
'resolved soon. The quality and variety of open source models continue to '
'improve, despite the cost and difficulty of training and fine-tuning them.')