LanceDB Main

Main

agentsllmsvector-databaselancedbgptopenaiAImultimodal-aimachine-learningembeddingsfine-tuningexamplesdeep-learninggpt-4-visionllama-indexragarxiv-recommendermultimodallangchainlancedb-recipes

alph-notebooks/lancedb-recipes / main.ipynb

Export

Run Notebooks

Contents

No cells yet

Add cells to see them here

Arxiv Search with OpenCLIP

In this example, we’ll create an Arxiv search or recommender system using multimodal semantic search powered by LanceDB. We’ll also compare its results with keyword-based search on Nomic's Atlas.

OpenCLIP

OpenCLIP is an open-source version of CLIP, a model that links images and text. It learns to understand both by training on pairs of pictures and descriptions. You can use it to:

Match images with text (like finding pictures based on a description).
Classify images without extra training (zero-shot learning).
Build creative tools (e.g., text-to-image models).

It’s flexible, free to use, and works well for tasks combining images and language.

CLIP (1)

Now let's jump into building Arxiv papers recommendation system

[ ]

     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 115.1/115.1 kB 1.6 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.5/1.5 MB 12.9 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 21.6/21.6 MB 45.5 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 53.4/53.4 kB 7.0 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.2/2.2 MB 107.3 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 81.1/81.1 kB 13.0 MB/s eta 0:00:00
  Preparing metadata (setup.py) ... done
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 98.7/98.7 kB 9.2 MB/s eta 0:00:00
  Building wheel for sgmllib3k (setup.py) ... done

Embedding Paper Summary with OpenCLIP

[ ]

Create a DataFrame of the desired length

Here we'll use arxiv python utility to interact with Arxiv API and get the document data

[ ]

<ipython-input-4-32cd16380017>:7: DeprecationWarning: The 'Search.results' method is deprecated, use 'Client.results' instead
  ).results()
  0%|          | 0/100 [00:00<?, ?it/s]/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_token.py:88: UserWarning: 
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
  warnings.warn(

open_clip_pytorch_model.bin:   0%|          | 0.00/605M [00:00<?, ?B/s]

100%|██████████| 100/100 [02:57<00:00,  1.77s/it]

[ ]

Semantic Searching by Concepts or Summary

[ ]

Start Querying

[ ]

Full Text Search

In text retrieval, full-text search refers to techniques for searching a single computer-stored document or a collection in a full-text database. Full-text search is distinguished from searches based on metadata or on parts of the original texts represented in databases

LanceDB now provides experimental support for full text search. This is currently Python only. We plan to push the integration down to Rust in the future to make this available for JS as well.

[ ]

Collecting tantivy
  Downloading tantivy-0.21.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.2 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 4.2/4.2 MB 17.2 MB/s eta 0:00:00
Installing collected packages: tantivy
Successfully installed tantivy-0.21.0

Build FTS index for the summary

Here, we're building the FTS index using python bindings for tantivy. You can also build the index for any other text column. A full-text index stores information about significant words and their location within one or more columns of a database table

[ ]

Analysing OpenCLIP embeddings on Nomic

Atlas is a platform for interacting with both small and internet scale unstructured datasets.

Atlas enables you to:

Store, update and organize multi-million point datasets of unstructured text, images and embeddings.
Visually interact with embeddings of your data from a web browser.
Operate over unstructured data and embeddings with topic modeling, semantic duplicate clustering and semantic search.
Generate high dimensional and two-dimensional embeddings of your data.

[ ]

     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 41.2/41.2 kB 1.5 MB/s eta 0:00:00
  Preparing metadata (setup.py) ... done
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 62.5/62.5 kB 3.1 MB/s eta 0:00:00
  Building wheel for nomic (setup.py) ... done

Nomic Login

We are using Nomic to use Atlas for visualizing dataset in clusters

[ ]

                                  Authenticate with the Nomic API                                   
                                  https://atlas.nomic.ai/cli-login                                  
       Click the above link to retrieve your access token and then run `nomic login [token]`

[ ]

2024-02-25 06:18:17.433 | WARNING  | nomic.atlas:map_data:96 - An ID field was not specified in your data so one was generated for you in insertion order.
2024-02-25 06:18:19.434 | INFO     | nomic.dataset:_create_project:868 - Creating dataset `inquisitive-jaynes`
2024-02-25 06:18:19.719 | INFO     | nomic.atlas:map_data:108 - Uploading data to Atlas.
1it [00:00,  1.62it/s]
2024-02-25 06:18:20.370 | INFO     | nomic.dataset:_add_data:1536 - Upload succeeded.
2024-02-25 06:18:20.374 | INFO     | nomic.atlas:map_data:123 - `prasantdixit9876/inquisitive-jaynes`: Data upload succeeded to dataset`
2024-02-25 06:18:20.655 | WARNING  | nomic.dataset:create_index:1121 - You did not specify the `topic_label_field` option in your topic_model, your dataset will not contain auto-labeled topics.
2024-02-25 06:18:21.396 | INFO     | nomic.dataset:create_index:1245 - Created map `inquisitive-jaynes` in dataset `prasantdixit9876/inquisitive-jaynes`: https://atlas.nomic.ai/data/prasantdixit9876/inquisitive-jaynes/map

The visualizations are very interesting and is worth exploring more. In preliminary analysis, you can see that it succesfully creates clusters of similar types of papers. There are a few things that can be done next like comparing embeddings on various openclip models sizes and datasets.