LanceDB Voyage X LanceDB

Voyage X LanceDB

agentsllmsvector-databaselancedbgptopenaiAImultimodal-aimachine-learningembeddingsfine-tuningexamplesdeep-learningvoyagexlancedbgpt-4-visionllama-indexragmultimodallangchainlancedb-recipes

alph-notebooks/lancedb-recipes / Voyage_x_LanceDB.ipynb

Export

Run Notebooks

Contents

No cells yet

Add cells to see them here

Vector Search with LanceDB using Voyage AI's Text and Multimodal Embeddings

Voyage AI embeddings provide high-precision, domain-specific vector representations across text and multimodal data, while LanceDB offers an ultra-fast, persistent vector database for efficient storage and retrieval. Together, It creates a powerful ecosystem for semantic search, enabling developers to build intelligent, context-aware applications with minimal computational overhead.

Screenshot from 2025-01-22 00-53-58.png

This notebook demonstrates a semantic search example where Voyage AI’s text and multimodal embeddings are used to create a powerful search system that integrates both text and images, enabling searches across both image and text data simultaneously.

Install Dependencies

[ ]

   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 32.2/32.2 MB 12.8 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 38.3/38.3 MB 7.1 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 480.6/480.6 kB 15.2 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 116.3/116.3 kB 5.2 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 179.3/179.3 kB 6.3 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 134.8/134.8 kB 5.0 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 194.1/194.1 kB 9.6 MB/s eta 0:00:00
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
gcsfs 2024.10.0 requires fsspec==2024.10.0, but you have fsspec 2024.9.0 which is incompatible.

Load dataset

This example utilizes Google Research's dataset Conceptual Captions

About Dataset

Dataset Summary Conceptual Captions is a dataset consisting of ~3.3M images annotated with captions. In contrast with the curated style of other image caption annotations, Conceptual Caption images and their raw descriptions are harvested from the web, and therefore represent a wider variety of styles. More precisely, the raw descriptions are harvested from the Alt-text HTML attribute associated with web images. To arrive at the current version of the captions, we have developed an automatic pipeline that extracts, filters, and transforms candidate image/caption pairs, with the goal of achieving a balance of cleanliness, informativeness, fluency, and learnability of the resulting captions.

We'll be loading validation set of this dataset as it contains 15k records, You can try it with train set too, Just make sure you filter all the images and confirm that all the image urls are working(filtering reachable image url is time consuming step)

[ ]

/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_auth.py:94: UserWarning: 
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
  warnings.warn(

README.md:   0%|          | 0.00/14.2k [00:00<?, ?B/s]

train-00000-of-00002.parquet:   0%|          | 0.00/187M [00:00<?, ?B/s]

train-00001-of-00002.parquet:   0%|          | 0.00/187M [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/1.77M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/3318333 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/15840 [00:00<?, ? examples/s]

[ ]

Index(['image_url', 'caption'], dtype='object')

Set Voyage API KEY as env variable

Add your Voyage API key as a secret in Google Colab. If you don't have one, you can sign up for one here (with 200M free tokens): https://dash.voyageai.com

[ ]

Voyage's Text model

[ ]

Requirement already satisfied: pyarrow in /usr/local/lib/python3.10/dist-packages (17.0.0)
Requirement already satisfied: numpy>=1.16.6 in /usr/local/lib/python3.10/dist-packages (from pyarrow) (1.26.4)

[ ]

Create LanceDB table to index data to do query search

This step may take sometime as data is getting ingested in batches into LanceDB

[ ]

Query indexed data with Voyage's voyage-3 text model

[ ]

Caption:  cat on a vintage chair in a sunny room

Caption:  young cute cat resting on leather sofa .

Caption:  cat lying down on the floor with paws up

[ ]

Caption:  mountains soar above villages along the shores

Caption:  snow on the distant mountains across the bay

Caption:  unique natural landscape the shore .

Voyage's Multimodal model

[ ]

Create LanceDB table with multimodal model to index data to do query search with either text or image