deepset Using Speaker Diarization With Assemblyai

Using Speaker Diarization With Assemblyai

agentic-aiagenticagentsgenaiAIhaystack-cookbookgenai-usecaseshaystack-ainotebooksPythonragai-tools

alph-notebooks/haystack-cookbook / using_speaker_diarization_with_assemblyai.ipynb

Export

Run Notebooks

Contents

No cells yet

Add cells to see them here

Speaker Diarization with AssemblyAI

📚 This cookbook has an accompanying article with a complete walkthrough "Level up Your RAG Application with Speaker Diarization"

LLMs excel with text data, answering complex questions without manual reading or searching. When dealing with audio or video, providing transcription is key. Transcription captures spoken content of the audio or video, but in multi-speaker recordings, it may miss non-verbal information and fail to convey speaker count or individual remarks. Therefore, to maximize the LLM's potential with such recordings, Speaker Diarization is essential!

In this example, we'll build a RAG application with speaker labels for audio files. This application will use Haystack and speaker diarization models by AssemblyAI.

📚 Useful Sources:

Integration: AssemblyAI

Install the Dependencies

[ ]

Download The Audio Files

We extracted the audio from youtube videos and saved them in a Google Drive Folder for you: https://drive.google.com/drive/folders/10zsFuHmj3oytYMyGrLdytpW-6JzT9T_W?usp=drive_link

You can run the code below to download the audio files to this colab notebook under "Files" tab on the left bar.

[3]

Retrieving folder contents
Processing file 12654ySXSYc2rZnPgNxXZwWt2hH-kTNDZ Netflix_Q4_2023_Earnings_Interview.mp3
Processing file 1Zb15D_nrBzWlM3K8FuPOmyiCiYvsuJLD Panel_Discussion.mp3
Processing file 1FFKGEZAUSmJayZgGaAe1uFP9HUtOK5m- Working_From_Home_Debate.mp3
Retrieving folder contents completed
Building directory structure
Building directory structure completed
Downloading...
From: https://drive.google.com/uc?id=12654ySXSYc2rZnPgNxXZwWt2hH-kTNDZ
To: /content/Netflix_Q4_2023_Earnings_Interview.mp3
100% 39.1M/39.1M [00:00<00:00, 67.6MB/s]
Downloading...
From: https://drive.google.com/uc?id=1Zb15D_nrBzWlM3K8FuPOmyiCiYvsuJLD
To: /content/Panel_Discussion.mp3
100% 21.8M/21.8M [00:00<00:00, 60.4MB/s]
Downloading...
From: https://drive.google.com/uc?id=1FFKGEZAUSmJayZgGaAe1uFP9HUtOK5m-
To: /content/Working_From_Home_Debate.mp3
100% 4.45M/4.45M [00:00<00:00, 34.8MB/s]
Download completed

Add Your API Keys

Enter the API keys from AssemblyAI and Hugging Face:

[4]

Enter your ASSEMBLYAI_API_KEY: ··········
HF_API_TOKEN: ··········

Index Speaker Labels to Your DocumentStore

Build a pipeline to generate speaker labels and index them into a DocumentStore with their embeddings. In this pipeline, you need:

InMemoryDocumentStore: to store your documents without external dependencies or extra setup
AssemblyAITranscriber: to create speaker_labels for the given audio file and convert them into Haystack Documents
DocumentSplitter: to split your documents into smaller chunks
SentenceTransformersDocumentEmbedder: to create embeddings for each document using sentence-transformers models
DocumentWriter: to write these documents into your document store

Note: The speaker information will be saved in the meta of the Document object

[ ]

Give an audio_file_path and run your pipeline

[6]

/usr/local/lib/python3.10/dist-packages/sentence_transformers/SentenceTransformer.py:92: FutureWarning: The `use_auth_token` argument is deprecated and will be removed in v3 of SentenceTransformers.
  warnings.warn(

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Batches:   0%|          | 0/2 [00:00<?, ?it/s]

{'transcriber': {'transcription': [Document(id=427e56c68f0440dd8f51643ba52e2a2b60c739f4fc42ddab7207fb428da4492d, content: 'I want to start with you, Amy, because I know you, obviously at Shell have had AI as part of the wor...', meta: {'transcript_id': 'c053a806-6826-40ac-a6bc-95cab9b4cb8a', 'audio_url': 'https://cdn.assemblyai.com/upload/188cdd14-ff33-4468-81cb-e2c337674fc5'})]},
, 'speaker_writer': {'documents_written': 64}}

RAG Pipeline with Speaker Labels

Build a RAG pipeline to generate answers to questions about the recording. Ensure that speaker information (provided through the metadata of the document) is included in the prompt for the LLM to distinguish who said what. For this pipeline, you need:

SentenceTransformersTextEmbedder: To create an embedding for the user query using sentence-transformers models
InMemoryEmbeddingRetriever: to retrieve top_k relevant documents to the user query
PromptBuilder: to provide a RAG prompt template with instructions to be filled with retrieved documents and the user query
HuggingFaceAPIGenerator: to infer models served through Hugging Face free Serverless Inference API or Hugging Face TGI

The LLM in the example (mistralai/Mixtral-8x7B-Instruct-v0.1) is a gated model. Make sure you have access to the model.

[ ]

Test RAG with Speaker Labels

[26]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

" Speaker A is interested in understanding how companies decide between building in-house solutions or using third parties. Speaker B believes that the decision depends on whether the task is part of the company's core IP or not. They also mention that the build versus buy decision is too simplistic, as there are other options like partnering or using third-party platforms. Speaker C takes a mixed approach, using open source and partnering, and emphasizes the importance of embedding AI into the business. Speaker B thinks that AI is not magic and requires hard work, process, and change management, just like any other business process."