Main
Youtube Transcript Search QA Bot
This Q&A bot will allow you to search through youtube transcripts using natural language! By going through this notebook, we'll introduce how you can use LanceDB to store and manage your data easily.
Install dependencies
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 76.5/76.5 kB 3.5 MB/s eta 0:00:00 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 480.6/480.6 kB 11.4 MB/s eta 0:00:00 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 116.3/116.3 kB 6.5 MB/s eta 0:00:00 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 179.3/179.3 kB 1.6 MB/s eta 0:00:00 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 134.8/134.8 kB 645.3 kB/s eta 0:00:00 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 194.1/194.1 kB 9.6 MB/s eta 0:00:00 ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts. gcsfs 2024.10.0 requires fsspec==2024.10.0, but you have fsspec 2024.9.0 which is incompatible. ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 27.4/27.4 MB 19.0 MB/s eta 0:00:00 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 30.5/30.5 MB 19.8 MB/s eta 0:00:00
Download the data
For this application we are going to use the HuggingFace dataset -jamescalam/youtube-transcriptions.
For more information, you can find the dataset here.
We'll use the training a split with 700 videos and 208619 sentences.
/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_auth.py:94: UserWarning: The secret `HF_TOKEN` does not exist in your Colab secrets. To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session. You will be able to reuse this secret in all of your notebooks. Please note that authentication is recommended but still optional to access public models or datasets. warnings.warn(
README.md: 0%| | 0.00/2.13k [00:00<?, ?B/s]
youtube-transcriptions.jsonl: 0%| | 0.00/79.8M [00:00<?, ?B/s]
Generating train split: 0%| | 0/208619 [00:00<?, ? examples/s]
Dataset({
, features: ['title', 'published', 'url', 'video_id', 'channel_id', 'id', 'text', 'start', 'end'],
, num_rows: 208619
,}) Prepare context
Each item in the dataset contains just a short chunk of text. We'll need to merge a bunch of these chunks together on a rolling basis. For this demo, we'll merge 20 rows and step over 4 rows at a time. LanceDB offers chaining support so you can write declarative, readable and parameterized queries. Here we serialize to Pandas as well:
Create embedding function
To create embeddings out of the text corpus, we'll call the OpenAI embeddings API to get embeddings. Open a free account here and get credits.
We'll use the ada2 text embeddings model
Create the LanceDB Table
OpenAI API often fails or times out. So LanceDB's API provides retry and throttling features behind the scenes to make it easier to call these APIs. In LanceDB the primary abstraction you'll use to work with your data is a Table. A Table is designed to store large numbers of columns and huge quantities of data! For those interested, a LanceDB is columnar-based, and uses Lance, an open data format to store data.
<ipython-input-7-f97cb4ac95c6>:4: DeprecationWarning: Function with_embeddings is deprecated and will be removed in a future version data = with_embeddings(embed_func, df, show_progress=True)
0%| | 0/53 [00:00<?, ?it/s]
Connect to the LanceDB
For more information on the lancedb interface, head over to the docs
/usr/local/lib/python3.10/dist-packages/ipykernel/ipkernel.py:283: DeprecationWarning: `should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above. and should_run_async(code)
52250
The table is backed by a Lance dataset so it's easy to integrate into other tools (e.g., pandas)
Create and answer the prompt
For a given context (bunch of text), we can ask the OpenAI Completion API to answer an arbitrary question using the following prompt:
/usr/local/lib/python3.10/dist-packages/ipykernel/ipkernel.py:283: DeprecationWarning: `should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above. and should_run_async(code)
'The 12th person on the moon was Eugene Cernan, who landed on December 11, 1972 as part of the Apollo 17 mission.'
Let's put it all together now
Again we'll use LanceDB's chaining query API. This time, we'll perform similarity search to find similar embeddings to our query. We can easily tweak the parameters in the query to produce the best result.
'You can use NLI with multiple negative ranking loss or STS with cosine similarity loss.'