Using Speaker Diarization With Assemblyai
Speaker Diarization with AssemblyAI
📚 This cookbook has an accompanying article with a complete walkthrough "Level up Your RAG Application with Speaker Diarization"
LLMs excel with text data, answering complex questions without manual reading or searching. When dealing with audio or video, providing transcription is key. Transcription captures spoken content of the audio or video, but in multi-speaker recordings, it may miss non-verbal information and fail to convey speaker count or individual remarks. Therefore, to maximize the LLM's potential with such recordings, Speaker Diarization is essential!
In this example, we'll build a RAG application with speaker labels for audio files. This application will use Haystack and speaker diarization models by AssemblyAI.
📚 Useful Sources:
Install the Dependencies
Download The Audio Files
We extracted the audio from youtube videos and saved them in a Google Drive Folder for you: https://drive.google.com/drive/folders/10zsFuHmj3oytYMyGrLdytpW-6JzT9T_W?usp=drive_link
You can run the code below to download the audio files to this colab notebook under "Files" tab on the left bar.
Retrieving folder contents Processing file 12654ySXSYc2rZnPgNxXZwWt2hH-kTNDZ Netflix_Q4_2023_Earnings_Interview.mp3 Processing file 1Zb15D_nrBzWlM3K8FuPOmyiCiYvsuJLD Panel_Discussion.mp3 Processing file 1FFKGEZAUSmJayZgGaAe1uFP9HUtOK5m- Working_From_Home_Debate.mp3 Retrieving folder contents completed Building directory structure Building directory structure completed Downloading... From: https://drive.google.com/uc?id=12654ySXSYc2rZnPgNxXZwWt2hH-kTNDZ To: /content/Netflix_Q4_2023_Earnings_Interview.mp3 100% 39.1M/39.1M [00:00<00:00, 67.6MB/s] Downloading... From: https://drive.google.com/uc?id=1Zb15D_nrBzWlM3K8FuPOmyiCiYvsuJLD To: /content/Panel_Discussion.mp3 100% 21.8M/21.8M [00:00<00:00, 60.4MB/s] Downloading... From: https://drive.google.com/uc?id=1FFKGEZAUSmJayZgGaAe1uFP9HUtOK5m- To: /content/Working_From_Home_Debate.mp3 100% 4.45M/4.45M [00:00<00:00, 34.8MB/s] Download completed
Add Your API Keys
Enter the API keys from AssemblyAI and Hugging Face:
Enter your ASSEMBLYAI_API_KEY: ·········· HF_API_TOKEN: ··········
Index Speaker Labels to Your DocumentStore
Build a pipeline to generate speaker labels and index them into a DocumentStore with their embeddings. In this pipeline, you need:
- InMemoryDocumentStore: to store your documents without external dependencies or extra setup
- AssemblyAITranscriber: to create speaker_labels for the given audio file and convert them into Haystack Documents
- DocumentSplitter: to split your documents into smaller chunks
- SentenceTransformersDocumentEmbedder: to create embeddings for each document using sentence-transformers models
- DocumentWriter: to write these documents into your document store
Note: The speaker information will be saved in the meta of the Document object
Give an audio_file_path and run your pipeline
/usr/local/lib/python3.10/dist-packages/sentence_transformers/SentenceTransformer.py:92: FutureWarning: The `use_auth_token` argument is deprecated and will be removed in v3 of SentenceTransformers. warnings.warn(
modules.json: 0%| | 0.00/349 [00:00<?, ?B/s]
config_sentence_transformers.json: 0%| | 0.00/116 [00:00<?, ?B/s]
README.md: 0%| | 0.00/10.6k [00:00<?, ?B/s]
sentence_bert_config.json: 0%| | 0.00/53.0 [00:00<?, ?B/s]
config.json: 0%| | 0.00/571 [00:00<?, ?B/s]
pytorch_model.bin: 0%| | 0.00/438M [00:00<?, ?B/s]
tokenizer_config.json: 0%| | 0.00/363 [00:00<?, ?B/s]
vocab.txt: 0%| | 0.00/232k [00:00<?, ?B/s]
tokenizer.json: 0%| | 0.00/466k [00:00<?, ?B/s]
special_tokens_map.json: 0%| | 0.00/239 [00:00<?, ?B/s]
1_Pooling/config.json: 0%| | 0.00/190 [00:00<?, ?B/s]
Batches: 0%| | 0/2 [00:00<?, ?it/s]
{'transcriber': {'transcription': [Document(id=427e56c68f0440dd8f51643ba52e2a2b60c739f4fc42ddab7207fb428da4492d, content: 'I want to start with you, Amy, because I know you, obviously at Shell have had AI as part of the wor...', meta: {'transcript_id': 'c053a806-6826-40ac-a6bc-95cab9b4cb8a', 'audio_url': 'https://cdn.assemblyai.com/upload/188cdd14-ff33-4468-81cb-e2c337674fc5'})]},
, 'speaker_writer': {'documents_written': 64}} RAG Pipeline with Speaker Labels
Build a RAG pipeline to generate answers to questions about the recording. Ensure that speaker information (provided through the metadata of the document) is included in the prompt for the LLM to distinguish who said what. For this pipeline, you need:
- SentenceTransformersTextEmbedder: To create an embedding for the user query using sentence-transformers models
- InMemoryEmbeddingRetriever: to retrieve
top_krelevant documents to the user query - PromptBuilder: to provide a RAG prompt template with instructions to be filled with retrieved documents and the user query
- HuggingFaceAPIGenerator: to infer models served through Hugging Face free Serverless Inference API or Hugging Face TGI
The LLM in the example (
mistralai/Mixtral-8x7B-Instruct-v0.1) is a gated model. Make sure you have access to the model.
Test RAG with Speaker Labels
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Batches: 0%| | 0/1 [00:00<?, ?it/s]
" Speaker A is interested in understanding how companies decide between building in-house solutions or using third parties. Speaker B believes that the decision depends on whether the task is part of the company's core IP or not. They also mention that the build versus buy decision is too simplistic, as there are other options like partnering or using third-party platforms. Speaker C takes a mixed approach, using open source and partnering, and emphasizes the importance of embedding AI into the business. Speaker B thinks that AI is not magic and requires hard work, process, and change management, just like any other business process."