LanceDB Colpali

Colpali

agentsllmsvector-databaselancedbgptopenaiAImultimodal-aimachine-learningembeddingsfine-tuningColPali-vision-retrieverexamplesdeep-learninggpt-4-visionllama-indexragmultimodallangchainlancedb-recipes

alph-notebooks/lancedb-recipes / colpali.ipynb

Export

Run Notebooks

Contents

No cells yet

Add cells to see them here

Late interaction & efficient Multi-modal retrievers need more than just a vector index

Blog -

Codebase - https://github.com/AyushExel/vision-retrieval (Based on https://github.com/kyryl-opens-ml/vision-retrieval)

ColPali is a visual retriever model that combines:

PaliGemma - A VLM which combines SigLIP-So400m/14 vision encoder and Gemma-2B language model. It also introduces a projection layers to map language model inputs to 128-dim vectors
A late interaction mechanism based on ColBert

Like ColBert, ColPali works in 2 phases:

Offline:
- Each document input is processed through the vision encoder and in patches. Each patch is then passed through the projection layer to get its vector representation.
- Then these vectors are stored as multi-vector representation of the documents for retrieval at the query-time
- Each document is divided into 1030 patches with each patch resulting in a 128-dim vector.
Online:
- At query time, the user input is encoded using language model
- Using the late interaction mechanism, MaxSims are calculated between query and already embedding document patches.
- Similarity score of each patch is summer across the pages & the top K pages with maximum similarity score are returned as the final result

[ ]

   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 4.5/4.5 MB 27.1 MB/s eta 0:00:00

Important set HF auth token

To use Pali-Gemma 3B model, you need to set HF_TOKEN on colab.

[ ]

Load Models

Only load the ColPlai model if memory is a constraint

[ ]

/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_token.py:89: UserWarning: 
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
  warnings.warn(
/usr/local/lib/python3.10/dist-packages/transformers/models/paligemma/configuration_paligemma.py:137: FutureWarning: The `vocab_size` attribute is deprecated and will be removed in v4.44, Please use `text_config.vocab_size` instead.
  warnings.warn(
`config.hidden_act` is ignored, you should use `config.hidden_activation` instead.
Gemma's activation function will be set to `gelu_pytorch_tanh`. Please, use
`config.hidden_activation` if you want to override this behaviour.
See https://github.com/huggingface/transformers/pull/29402 for more details.

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

Some weights of ColPali were not initialized from the model checkpoint at google/paligemma-3b-mix-448 and are newly initialized: ['custom_text_proj.bias', 'custom_text_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

Let try it out!

In this example, we'll make retrieval challenging by ingesting documents very different genres:

Investor relations/ Financial reports of Q2 2024 from:
- Apple
- Amazon
- Meta
- Alphabet
- Netflix
- Starbucks
Naturo Volume 72
Arabian Nights
Children's short story collection.
InraRed Cloud report
Short Stories for Children by Cuentos para la clase de Inglés

Download dataset

[ ]

archive1.zip:   0%|          | 0.00/167M [00:00<?, ?B/s]

Archive:  archive1.zip
   creating: fin_pdf_data/
  inflating: fin_pdf_data/FINAL-Q2-24-Shareholder-Letter.pdf  
  inflating: fin_pdf_data/Meta - Meta Reports Second Quarter 2024 Results.pdf  
  inflating: fin_pdf_data/Starbucks Coffee Company - Starbucks Reports Q2 Fiscal 2024 Results.pdf  
  inflating: fin_pdf_data/Naruto.pdf  
  inflating: fin_pdf_data/InfraRedReport.pdf  
  inflating: fin_pdf_data/(Dover Fine Art, History of Art) Menges, Jeff A - Arabian Nights Illustrated_ Art of Dulac, Folkard, Parrish and Others-Dover Publications (2012).pdf  
  inflating: fin_pdf_data/2024q2-alphabet-earnings-release.pdf  
  inflating: fin_pdf_data/FY24_Q2_Consolidated_Financial_Statements.pdf  
  inflating: fin_pdf_data/AMZN-Q2-2024-Earnings-Release.pdf  
  inflating: fin_pdf_data/Anonymous - Aladdin and Other Tales from the Arabian Nights (Puffin Classics - the Essential Collection) (1997).pdf  
  inflating: fin_pdf_data/Microsoft (MSFT) Q2 earnings report 2024.pdf  
  inflating: fin_pdf_data/short-stories-for-children-ingles-primaria-continuemos-estudiando.pdf  
   creating: fin_pdf_data/.ipynb_checkpoints/
  inflating: fin_pdf_data/.ipynb_checkpoints/2024q2-alphabet-earnings-release-checkpoint.pdf  
  inflating: fin_pdf_data/.ipynb_checkpoints/short-stories-for-children-ingles-primaria-continuemos-estudiando-checkpoint.pdf  
  inflating: fin_pdf_data/.ipynb_checkpoints/InfraRedReport-checkpoint.pdf  
  inflating: fin_pdf_data/.ipynb_checkpoints/Anonymous - Aladdin and Other Tales from the Arabian Nights (Puffin Classics - the Essential Collection) (1997)-checkpoint.pdf  
  inflating: fin_pdf_data/.ipynb_checkpoints/(Dover Fine Art, History of Art) Menges, Jeff A - Arabian Nights Illustrated_ Art of Dulac, Folkard, Parrish and Others-Dover Publications (2012)-checkpoint.pdf

[Optional] remove some docs to speed up

[ ]

Ingest data into LanceDB

Here we're steaming batch iterators to LanceDB which persists the data in disk so you're unlikely to run OOM on a decent machine even though the vectors of each document is very high dimensionsal (1030x128)!

NOTE: Consider removeing some PDF files from the ./fin_pdf_data folder to speed up the ingestion process if it takes a long time

NOTE 2: For one of the experiment/comparison, here we're reading the PDFs and storing their text in a column to create FTS/BM25 index. In practice, this can be an optional step to speed up ingestion

[ ]

Selecting previously unselected package poppler-utils.
(Reading database ... 123599 files and directories currently installed.)
Preparing to unpack .../poppler-utils_22.02.0-2ubuntu0.5_amd64.deb ...
Unpacking poppler-utils (22.02.0-2ubuntu0.5) ...
Setting up poppler-utils (22.02.0-2ubuntu0.5) ...
Processing triggers for man-db (2.10.2-1) ...

[ ]

./fin_pdf_data/(Dover Fine Art, History of Art) Menges, Jeff A - Arabian Nights Illustrated_ Art of Dulac, Folkard, Parrish and Others-Dover Publications (2012).pdf

0it [00:00, ?it/s]
1it [00:19, 19.53s/it]
3it [00:25,  7.25s/it]
5it [00:31,  5.05s/it]
7it [00:37,  4.19s/it]
9it [00:43,  3.75s/it]
11it [00:49,  3.48s/it]
13it [00:55,  3.29s/it]
15it [01:01,  3.17s/it]
17it [01:06,  3.07s/it]
19it [01:12,  3.00s/it]
21it [01:18,  2.94s/it]
23it [01:23,  2.91s/it]
25it [01:29,  2.89s/it]
27it [01:35,  2.88s/it]
29it [01:40,  2.88s/it]
31it [01:46,  2.88s/it]
33it [01:52,  2.88s/it]
35it [01:58,  2.89s/it]
37it [02:04,  2.90s/it]
39it [02:09,  2.91s/it]
41it [02:15,  2.91s/it]
43it [02:21,  2.92s/it]
45it [02:27,  2.91s/it]
47it [02:33,  2.92s/it]
49it [02:39,  2.91s/it]
51it [02:44,  2.89s/it]
53it [02:50,  2.88s/it]
55it [02:56,  2.88s/it]
57it [03:02,  2.89s/it]
59it [03:07,  2.88s/it]
61it [03:13,  2.89s/it]
63it [03:19,  2.89s/it]
65it [03:25,  2.89s/it]
67it [03:30,  2.89s/it]
69it [03:36,  2.89s/it]
71it [03:42,  2.89s/it]
73it [03:48,  2.89s/it]
75it [03:54,  2.90s/it]
77it [03:59,  2.90s/it]
79it [04:05,  2.91s/it]
81it [04:11,  2.92s/it]
83it [04:17,  2.91s/it]
85it [04:23,  2.90s/it]
87it [04:28,  2.89s/it]
89it [04:34,  2.90s/it]
91it [04:40,  2.89s/it]
93it [04:46,  2.90s/it]
95it [04:52,  2.89s/it]
97it [04:57,  2.88s/it]
99it [05:03,  2.88s/it]
101it [05:09,  2.88s/it]
103it [05:15,  2.87s/it]
105it [05:20,  2.87s/it]
107it [05:26,  2.89s/it]
109it [05:32,  2.89s/it]
111it [05:38,  2.90s/it]
113it [05:43,  2.89s/it]
115it [05:49,  2.89s/it]
117it [05:55,  2.88s/it]
119it [06:01,  2.88s/it]
121it [06:07,  2.89s/it]
123it [06:12,  2.89s/it]
125it [06:18,  2.90s/it]
127it [06:24,  2.89s/it]
129it [06:30,  2.88s/it]
131it [06:35,  2.88s/it]
133it [06:41,  2.88s/it]
135it [06:47,  2.88s/it]
137it [06:53,  2.88s/it]
139it [06:59,  2.89s/it]
141it [07:04,  2.89s/it]
143it [07:10,  2.90s/it]
145it [07:16,  2.90s/it]
147it [07:22,  2.89s/it]
149it [07:27,  2.89s/it]
151it [07:33,  2.89s/it]
153it [07:39,  2.90s/it]
155it [07:45,  2.90s/it]
157it [07:51,  2.91s/it]
159it [07:57,  2.90s/it]
161it [08:02,  2.90s/it]
163it [08:08,  2.89s/it]
165it [08:14,  2.89s/it]
167it [08:20,  2.89s/it]
169it [08:25,  2.89s/it]
171it [08:31,  2.90s/it]
173it [08:37,  2.90s/it]
175it [08:43,  2.91s/it]
177it [08:49,  2.90s/it]
179it [08:55,  2.91s/it]
181it [09:00,  2.92s/it]
183it [09:06,  2.92s/it]
100%|██████████| 92/92 [08:53<00:00,  5.79s/it]
184it [09:07,  2.97s/it]

./fin_pdf_data/Anonymous - Aladdin and Other Tales from the Arabian Nights (Puffin Classics - the Essential Collection) (1997).pdf

0it [00:00, ?it/s]
1it [00:10, 10.28s/it]
3it [00:16,  4.79s/it]
5it [00:21,  3.81s/it]
7it [00:27,  3.43s/it]
9it [00:33,  3.25s/it]
11it [00:39,  3.13s/it]
13it [00:45,  3.05s/it]
15it [00:50,  3.00s/it]
17it [00:56,  2.97s/it]
19it [01:02,  2.93s/it]
21it [01:08,  2.90s/it]
23it [01:13,  2.89s/it]
100%|██████████| 12/12 [01:09<00:00,  5.77s/it]
24it [01:13,  3.07s/it]

./fin_pdf_data/InfraRedReport.pdf

0it [00:00, ?it/s]WARNING:pypdf._reader:Ignoring wrong pointing object 274 0 (offset 0)
WARNING:pypdf._reader:Ignoring wrong pointing object 276 0 (offset 0)
WARNING:pypdf._reader:Ignoring wrong pointing object 367 0 (offset 0)
WARNING:pypdf._reader:Ignoring wrong pointing object 371 0 (offset 0)
WARNING:pypdf._reader:Ignoring wrong pointing object 671 0 (offset 0)

1it [00:44, 44.82s/it]
3it [00:50, 13.72s/it]
5it [00:56,  8.13s/it]
7it [01:01,  5.91s/it]
9it [01:07,  4.79s/it]
11it [01:13,  4.16s/it]
13it [01:19,  3.75s/it]
15it [01:25,  3.48s/it]
17it [01:31,  3.30s/it]
19it [01:37,  3.17s/it]
21it [01:42,  3.07s/it]
23it [01:48,  2.99s/it]
25it [01:54,  2.93s/it]
27it [01:59,  2.89s/it]
29it [02:05,  2.87s/it]
31it [02:10,  2.86s/it]
33it [02:16,  2.85s/it]
35it [02:22,  2.85s/it]
37it [02:27,  2.85s/it]
39it [02:33,  2.85s/it]
41it [02:39,  2.86s/it]
43it [02:45,  2.86s/it]
45it [02:50,  2.87s/it]
47it [02:56,  2.87s/it]
49it [03:02,  2.87s/it]
51it [03:08,  2.86s/it]
53it [03:13,  2.86s/it]
55it [03:19,  2.86s/it]
100%|██████████| 28/28 [02:40<00:00,  5.73s/it]
56it [03:19,  3.56s/it]

./fin_pdf_data/Naruto.pdf

0it [00:00, ?it/s]
1it [03:06, 186.42s/it]
3it [03:12, 50.41s/it] 
5it [03:17, 26.00s/it]
7it [03:23, 16.23s/it]
9it [03:29, 11.25s/it]
11it [03:35,  8.36s/it]
13it [03:41,  6.59s/it]
15it [03:47,  5.42s/it]
17it [03:53,  4.63s/it]
19it [03:59,  4.10s/it]
21it [04:05,  3.73s/it]
23it [04:11,  3.48s/it]
25it [04:16,  3.29s/it]
27it [04:22,  3.17s/it]
29it [04:28,  3.08s/it]
31it [04:34,  3.01s/it]
33it [04:39,  2.97s/it]
35it [04:45,  2.96s/it]
37it [04:51,  2.94s/it]
39it [04:57,  2.93s/it]
41it [05:03,  2.94s/it]
43it [05:08,  2.93s/it]
45it [05:14,  2.94s/it]
47it [05:20,  2.92s/it]
49it [05:26,  2.92s/it]
51it [05:32,  2.91s/it]
53it [05:37,  2.90s/it]
55it [05:43,  2.91s/it]
57it [05:49,  2.90s/it]
59it [05:55,  2.91s/it]
61it [06:01,  2.90s/it]
63it [06:07,  2.90s/it]
65it [06:12,  2.89s/it]
67it [06:18,  2.89s/it]
69it [06:24,  2.89s/it]
71it [06:30,  2.89s/it]
73it [06:35,  2.90s/it]
75it [06:41,  2.90s/it]
77it [06:47,  2.91s/it]
79it [06:53,  2.91s/it]
81it [06:59,  2.90s/it]
83it [07:04,  2.90s/it]
85it [07:10,  2.90s/it]
87it [07:16,  2.92s/it]
89it [07:22,  2.91s/it]
91it [07:28,  2.93s/it]
93it [07:34,  2.92s/it]
95it [07:40,  2.93s/it]
97it [07:45,  2.92s/it]
99it [07:51,  2.92s/it]
101it [07:57,  2.91s/it]
103it [08:03,  2.91s/it]
105it [08:09,  2.92s/it]
107it [08:15,  2.91s/it]
109it [08:20,  2.92s/it]
111it [08:26,  2.91s/it]
113it [08:32,  2.90s/it]
115it [08:38,  2.89s/it]
117it [08:43,  2.88s/it]
119it [08:49,  2.88s/it]
121it [08:55,  2.89s/it]
123it [09:01,  2.90s/it]
125it [09:07,  2.90s/it]
127it [09:13,  2.92s/it]
129it [09:18,  2.91s/it]
131it [09:24,  2.90s/it]
133it [09:30,  2.90s/it]
135it [09:36,  2.90s/it]
137it [09:41,  2.90s/it]
139it [09:47,  2.90s/it]
141it [09:53,  2.91s/it]
143it [09:59,  2.91s/it]
145it [10:05,  2.92s/it]
147it [10:11,  2.91s/it]
149it [10:16,  2.91s/it]
151it [10:22,  2.91s/it]
153it [10:28,  2.91s/it]
155it [10:34,  2.92s/it]
157it [10:40,  2.92s/it]
159it [10:46,  2.92s/it]
161it [10:51,  2.91s/it]
163it [10:57,  2.91s/it]
165it [11:03,  2.90s/it]
167it [11:09,  2.90s/it]
169it [11:15,  2.89s/it]
171it [11:20,  2.89s/it]
173it [11:26,  2.91s/it]
175it [11:32,  2.91s/it]
177it [11:38,  2.91s/it]
179it [11:44,  2.90s/it]
181it [11:49,  2.90s/it]
183it [11:55,  2.89s/it]
185it [12:01,  2.88s/it]
187it [12:07,  2.89s/it]
189it [12:13,  2.89s/it]
191it [12:18,  2.91s/it]
193it [12:24,  2.90s/it]
195it [12:30,  2.91s/it]
197it [12:36,  2.91s/it]
100%|██████████| 99/99 [09:35<00:00,  5.82s/it]
198it [12:36,  3.82s/it]

./fin_pdf_data/short-stories-for-children-ingles-primaria-continuemos-estudiando.pdf

0it [00:00, ?it/s]
1it [00:15, 15.51s/it]
3it [00:21,  6.19s/it]
5it [00:27,  4.51s/it]
7it [00:33,  3.85s/it]
9it [00:39,  3.52s/it]
11it [00:44,  3.33s/it]
13it [00:50,  3.20s/it]
15it [00:56,  3.11s/it]
100%|██████████| 8/8 [00:46<00:00,  5.86s/it]
16it [00:56,  3.55s/it]

We've now ingested more than 550 pages!

retrieval

In this section, we'll see:

Performance of ColPali in retrieving the correct document. In this case, it'll use MaxSim operation across ALL ingested document pages.
Optimizing lookup time by reducing search space using LanceDB FTS ( ColPali as a FTS Reranker )
Optimizing lookup time by reducing search space using LanceDB Semantic search ( ColPali as a Vector Reranker )

ColPali Retrieval

[ ]

37.21562123298645

[ ]

ColPali MaxSims as a Reranking step

There are a couple of hacks that can be used to reduce the search space by filtering out some results. We'll reduce the search space a fifth (top 100 docs)

1. If text field is available (ColPali as FTS reranker)

In case you're able to extract the text from the pdf ( Here we're not referring to OCR. Some pdf encodings allow reading text directly from the file), you can create FTS index on the text column to try and reduce the search space. This'll effectively make the ColPali MaxSims operation a reranking step for FTS. In this example, we'll get the top 100 FTS matches, which brings down the search space to about 1/5th.

[ ]

6.828559398651123

[ ]

That reduced the time taken to a fifth as expected!

2. Reducing the search space using similarity search (ColPali as vector search reranker)

A vision retriever pipeline ideally shouldn't be dependent on being able to parse the the text from the documents as it defeats the purpose of being a 1-shot retrieval method.

Hypothesis

Remember, 128 dim vector projections are derived from the language model part of ColPali. Although query dim (in this case 25x128) isn't the same as the doc path embedding dim (1030x128). But because these are the representations of the same model, they might be able to capture similarity between query and the doc patches. So we can flatten them out, zero pad the query embeddings to match the doc patch embeddings, and run vector search to filter out the top_k and reduce the search space for ColPali MaxSims reranking.

[ ]

10.63455319404602

[ ]

Let's try a couple more queries from different books

[ ]

time taken 9.661753177642822

[ ]

time taken 9.042099237442017

[Optional] Generation

[ ]