Main
Arxiv Search with OpenCLIP
In this example, we’ll create an Arxiv search or recommender system using multimodal semantic search powered by LanceDB. We’ll also compare its results with keyword-based search on Nomic's Atlas.
OpenCLIP
OpenCLIP is an open-source version of CLIP, a model that links images and text. It learns to understand both by training on pairs of pictures and descriptions. You can use it to:
- Match images with text (like finding pictures based on a description).
- Classify images without extra training (zero-shot learning).
- Build creative tools (e.g., text-to-image models).
It’s flexible, free to use, and works well for tasks combining images and language.
Now let's jump into building Arxiv papers recommendation system
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 115.1/115.1 kB 1.6 MB/s eta 0:00:00 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.5/1.5 MB 12.9 MB/s eta 0:00:00 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 21.6/21.6 MB 45.5 MB/s eta 0:00:00 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 53.4/53.4 kB 7.0 MB/s eta 0:00:00 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.2/2.2 MB 107.3 MB/s eta 0:00:00 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 81.1/81.1 kB 13.0 MB/s eta 0:00:00 Preparing metadata (setup.py) ... done ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 98.7/98.7 kB 9.2 MB/s eta 0:00:00 Building wheel for sgmllib3k (setup.py) ... done
Embedding Paper Summary with OpenCLIP
Create a DataFrame of the desired length
Here we'll use arxiv python utility to interact with Arxiv API and get the document data
<ipython-input-4-32cd16380017>:7: DeprecationWarning: The 'Search.results' method is deprecated, use 'Client.results' instead ).results() 0%| | 0/100 [00:00<?, ?it/s]/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_token.py:88: UserWarning: The secret `HF_TOKEN` does not exist in your Colab secrets. To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session. You will be able to reuse this secret in all of your notebooks. Please note that authentication is recommended but still optional to access public models or datasets. warnings.warn(
open_clip_pytorch_model.bin: 0%| | 0.00/605M [00:00<?, ?B/s]
100%|██████████| 100/100 [02:57<00:00, 1.77s/it]
100
Semantic Searching by Concepts or Summary
100
Start Querying
Full Text Search
In text retrieval, full-text search refers to techniques for searching a single computer-stored document or a collection in a full-text database. Full-text search is distinguished from searches based on metadata or on parts of the original texts represented in databases
LanceDB now provides experimental support for full text search. This is currently Python only. We plan to push the integration down to Rust in the future to make this available for JS as well.
Collecting tantivy
Downloading tantivy-0.21.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.2 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 4.2/4.2 MB 17.2 MB/s eta 0:00:00
Installing collected packages: tantivy
Successfully installed tantivy-0.21.0
Build FTS index for the summary
Here, we're building the FTS index using python bindings for tantivy. You can also build the index for any other text column. A full-text index stores information about significant words and their location within one or more columns of a database table
Analysing OpenCLIP embeddings on Nomic
Atlas is a platform for interacting with both small and internet scale unstructured datasets.
Atlas enables you to:
- Store, update and organize multi-million point datasets of unstructured text, images and embeddings.
- Visually interact with embeddings of your data from a web browser.
- Operate over unstructured data and embeddings with topic modeling, semantic duplicate clustering and semantic search.
- Generate high dimensional and two-dimensional embeddings of your data.
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 41.2/41.2 kB 1.5 MB/s eta 0:00:00 Preparing metadata (setup.py) ... done ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 62.5/62.5 kB 3.1 MB/s eta 0:00:00 Building wheel for nomic (setup.py) ... done
Nomic Login
We are using Nomic to use Atlas for visualizing dataset in clusters
Authenticate with the Nomic API https://atlas.nomic.ai/cli-login Click the above link to retrieve your access token and then run `nomic login [token]`
2024-02-25 06:18:17.433 | WARNING | nomic.atlas:map_data:96 - An ID field was not specified in your data so one was generated for you in insertion order. 2024-02-25 06:18:19.434 | INFO | nomic.dataset:_create_project:868 - Creating dataset `inquisitive-jaynes` 2024-02-25 06:18:19.719 | INFO | nomic.atlas:map_data:108 - Uploading data to Atlas. 1it [00:00, 1.62it/s] 2024-02-25 06:18:20.370 | INFO | nomic.dataset:_add_data:1536 - Upload succeeded. 2024-02-25 06:18:20.374 | INFO | nomic.atlas:map_data:123 - `prasantdixit9876/inquisitive-jaynes`: Data upload succeeded to dataset` 2024-02-25 06:18:20.655 | WARNING | nomic.dataset:create_index:1121 - You did not specify the `topic_label_field` option in your topic_model, your dataset will not contain auto-labeled topics. 2024-02-25 06:18:21.396 | INFO | nomic.dataset:create_index:1245 - Created map `inquisitive-jaynes` in dataset `prasantdixit9876/inquisitive-jaynes`: https://atlas.nomic.ai/data/prasantdixit9876/inquisitive-jaynes/map
The visualizations are very interesting and is worth exploring more. In preliminary analysis, you can see that it succesfully creates clusters of similar types of papers. There are a few things that can be done next like comparing embeddings on various openclip models sizes and datasets.