Colpali
Late interaction & efficient Multi-modal retrievers need more than just a vector index
Blog -
Codebase - https://github.com/AyushExel/vision-retrieval (Based on https://github.com/kyryl-opens-ml/vision-retrieval)
ColPali is a visual retriever model that combines:
-
PaliGemma - A VLM which combines SigLIP-So400m/14 vision encoder and Gemma-2B language model. It also introduces a projection layers to map language model inputs to 128-dim vectors
-
A late interaction mechanism based on ColBert
Like ColBert, ColPali works in 2 phases:
- Offline:
- Each document input is processed through the vision encoder and in patches. Each patch is then passed through the projection layer to get its vector representation.
- Then these vectors are stored as multi-vector representation of the documents for retrieval at the query-time
- Each document is divided into 1030 patches with each patch resulting in a 128-dim vector.
- Online:
- At query time, the user input is encoded using language model
- Using the late interaction mechanism, MaxSims are calculated between query and already embedding document patches.
- Similarity score of each patch is summer across the pages & the top K pages with maximum similarity score are returned as the final result
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 4.5/4.5 MB 27.1 MB/s eta 0:00:00
Important set HF auth token
To use Pali-Gemma 3B model, you need to set HF_TOKEN on colab.
Load Models
Only load the ColPlai model if memory is a constraint
/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_token.py:89: UserWarning: The secret `HF_TOKEN` does not exist in your Colab secrets. To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session. You will be able to reuse this secret in all of your notebooks. Please note that authentication is recommended but still optional to access public models or datasets. warnings.warn( /usr/local/lib/python3.10/dist-packages/transformers/models/paligemma/configuration_paligemma.py:137: FutureWarning: The `vocab_size` attribute is deprecated and will be removed in v4.44, Please use `text_config.vocab_size` instead. warnings.warn( `config.hidden_act` is ignored, you should use `config.hidden_activation` instead. Gemma's activation function will be set to `gelu_pytorch_tanh`. Please, use `config.hidden_activation` if you want to override this behaviour. See https://github.com/huggingface/transformers/pull/29402 for more details.
Loading checkpoint shards: 0%| | 0/3 [00:00<?, ?it/s]
Some weights of ColPali were not initialized from the model checkpoint at google/paligemma-3b-mix-448 and are newly initialized: ['custom_text_proj.bias', 'custom_text_proj.weight'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Let try it out!
In this example, we'll make retrieval challenging by ingesting documents very different genres:
- Investor relations/ Financial reports of Q2 2024 from:
- Apple
- Amazon
- Meta
- Alphabet
- Netflix
- Starbucks
- Naturo Volume 72
- Arabian Nights
- Children's short story collection.
- InraRed Cloud report
- Short Stories for Children by Cuentos para la clase de Inglés
Download dataset
archive1.zip: 0%| | 0.00/167M [00:00<?, ?B/s]
Archive: archive1.zip creating: fin_pdf_data/ inflating: fin_pdf_data/FINAL-Q2-24-Shareholder-Letter.pdf inflating: fin_pdf_data/Meta - Meta Reports Second Quarter 2024 Results.pdf inflating: fin_pdf_data/Starbucks Coffee Company - Starbucks Reports Q2 Fiscal 2024 Results.pdf inflating: fin_pdf_data/Naruto.pdf inflating: fin_pdf_data/InfraRedReport.pdf inflating: fin_pdf_data/(Dover Fine Art, History of Art) Menges, Jeff A - Arabian Nights Illustrated_ Art of Dulac, Folkard, Parrish and Others-Dover Publications (2012).pdf inflating: fin_pdf_data/2024q2-alphabet-earnings-release.pdf inflating: fin_pdf_data/FY24_Q2_Consolidated_Financial_Statements.pdf inflating: fin_pdf_data/AMZN-Q2-2024-Earnings-Release.pdf inflating: fin_pdf_data/Anonymous - Aladdin and Other Tales from the Arabian Nights (Puffin Classics - the Essential Collection) (1997).pdf inflating: fin_pdf_data/Microsoft (MSFT) Q2 earnings report 2024.pdf inflating: fin_pdf_data/short-stories-for-children-ingles-primaria-continuemos-estudiando.pdf creating: fin_pdf_data/.ipynb_checkpoints/ inflating: fin_pdf_data/.ipynb_checkpoints/2024q2-alphabet-earnings-release-checkpoint.pdf inflating: fin_pdf_data/.ipynb_checkpoints/short-stories-for-children-ingles-primaria-continuemos-estudiando-checkpoint.pdf inflating: fin_pdf_data/.ipynb_checkpoints/InfraRedReport-checkpoint.pdf inflating: fin_pdf_data/.ipynb_checkpoints/Anonymous - Aladdin and Other Tales from the Arabian Nights (Puffin Classics - the Essential Collection) (1997)-checkpoint.pdf inflating: fin_pdf_data/.ipynb_checkpoints/(Dover Fine Art, History of Art) Menges, Jeff A - Arabian Nights Illustrated_ Art of Dulac, Folkard, Parrish and Others-Dover Publications (2012)-checkpoint.pdf
[Optional] remove some docs to speed up
Ingest data into LanceDB
Here we're steaming batch iterators to LanceDB which persists the data in disk so you're unlikely to run OOM on a decent machine even though the vectors of each document is very high dimensionsal (1030x128)!
NOTE: Consider removeing some PDF files from the ./fin_pdf_data folder to speed up the ingestion process if it takes a long time
NOTE 2: For one of the experiment/comparison, here we're reading the PDFs and storing their text in a column to create FTS/BM25 index. In practice, this can be an optional step to speed up ingestion
Selecting previously unselected package poppler-utils. (Reading database ... 123599 files and directories currently installed.) Preparing to unpack .../poppler-utils_22.02.0-2ubuntu0.5_amd64.deb ... Unpacking poppler-utils (22.02.0-2ubuntu0.5) ... Setting up poppler-utils (22.02.0-2ubuntu0.5) ... Processing triggers for man-db (2.10.2-1) ...
./fin_pdf_data/(Dover Fine Art, History of Art) Menges, Jeff A - Arabian Nights Illustrated_ Art of Dulac, Folkard, Parrish and Others-Dover Publications (2012).pdf
0it [00:00, ?it/s] 1it [00:19, 19.53s/it] 3it [00:25, 7.25s/it] 5it [00:31, 5.05s/it] 7it [00:37, 4.19s/it] 9it [00:43, 3.75s/it] 11it [00:49, 3.48s/it] 13it [00:55, 3.29s/it] 15it [01:01, 3.17s/it] 17it [01:06, 3.07s/it] 19it [01:12, 3.00s/it] 21it [01:18, 2.94s/it] 23it [01:23, 2.91s/it] 25it [01:29, 2.89s/it] 27it [01:35, 2.88s/it] 29it [01:40, 2.88s/it] 31it [01:46, 2.88s/it] 33it [01:52, 2.88s/it] 35it [01:58, 2.89s/it] 37it [02:04, 2.90s/it] 39it [02:09, 2.91s/it] 41it [02:15, 2.91s/it] 43it [02:21, 2.92s/it] 45it [02:27, 2.91s/it] 47it [02:33, 2.92s/it] 49it [02:39, 2.91s/it] 51it [02:44, 2.89s/it] 53it [02:50, 2.88s/it] 55it [02:56, 2.88s/it] 57it [03:02, 2.89s/it] 59it [03:07, 2.88s/it] 61it [03:13, 2.89s/it] 63it [03:19, 2.89s/it] 65it [03:25, 2.89s/it] 67it [03:30, 2.89s/it] 69it [03:36, 2.89s/it] 71it [03:42, 2.89s/it] 73it [03:48, 2.89s/it] 75it [03:54, 2.90s/it] 77it [03:59, 2.90s/it] 79it [04:05, 2.91s/it] 81it [04:11, 2.92s/it] 83it [04:17, 2.91s/it] 85it [04:23, 2.90s/it] 87it [04:28, 2.89s/it] 89it [04:34, 2.90s/it] 91it [04:40, 2.89s/it] 93it [04:46, 2.90s/it] 95it [04:52, 2.89s/it] 97it [04:57, 2.88s/it] 99it [05:03, 2.88s/it] 101it [05:09, 2.88s/it] 103it [05:15, 2.87s/it] 105it [05:20, 2.87s/it] 107it [05:26, 2.89s/it] 109it [05:32, 2.89s/it] 111it [05:38, 2.90s/it] 113it [05:43, 2.89s/it] 115it [05:49, 2.89s/it] 117it [05:55, 2.88s/it] 119it [06:01, 2.88s/it] 121it [06:07, 2.89s/it] 123it [06:12, 2.89s/it] 125it [06:18, 2.90s/it] 127it [06:24, 2.89s/it] 129it [06:30, 2.88s/it] 131it [06:35, 2.88s/it] 133it [06:41, 2.88s/it] 135it [06:47, 2.88s/it] 137it [06:53, 2.88s/it] 139it [06:59, 2.89s/it] 141it [07:04, 2.89s/it] 143it [07:10, 2.90s/it] 145it [07:16, 2.90s/it] 147it [07:22, 2.89s/it] 149it [07:27, 2.89s/it] 151it [07:33, 2.89s/it] 153it [07:39, 2.90s/it] 155it [07:45, 2.90s/it] 157it [07:51, 2.91s/it] 159it [07:57, 2.90s/it] 161it [08:02, 2.90s/it] 163it [08:08, 2.89s/it] 165it [08:14, 2.89s/it] 167it [08:20, 2.89s/it] 169it [08:25, 2.89s/it] 171it [08:31, 2.90s/it] 173it [08:37, 2.90s/it] 175it [08:43, 2.91s/it] 177it [08:49, 2.90s/it] 179it [08:55, 2.91s/it] 181it [09:00, 2.92s/it] 183it [09:06, 2.92s/it] 100%|██████████| 92/92 [08:53<00:00, 5.79s/it] 184it [09:07, 2.97s/it]
./fin_pdf_data/Anonymous - Aladdin and Other Tales from the Arabian Nights (Puffin Classics - the Essential Collection) (1997).pdf
0it [00:00, ?it/s] 1it [00:10, 10.28s/it] 3it [00:16, 4.79s/it] 5it [00:21, 3.81s/it] 7it [00:27, 3.43s/it] 9it [00:33, 3.25s/it] 11it [00:39, 3.13s/it] 13it [00:45, 3.05s/it] 15it [00:50, 3.00s/it] 17it [00:56, 2.97s/it] 19it [01:02, 2.93s/it] 21it [01:08, 2.90s/it] 23it [01:13, 2.89s/it] 100%|██████████| 12/12 [01:09<00:00, 5.77s/it] 24it [01:13, 3.07s/it]
./fin_pdf_data/InfraRedReport.pdf
0it [00:00, ?it/s]WARNING:pypdf._reader:Ignoring wrong pointing object 274 0 (offset 0) WARNING:pypdf._reader:Ignoring wrong pointing object 276 0 (offset 0) WARNING:pypdf._reader:Ignoring wrong pointing object 367 0 (offset 0) WARNING:pypdf._reader:Ignoring wrong pointing object 371 0 (offset 0) WARNING:pypdf._reader:Ignoring wrong pointing object 671 0 (offset 0) 1it [00:44, 44.82s/it] 3it [00:50, 13.72s/it] 5it [00:56, 8.13s/it] 7it [01:01, 5.91s/it] 9it [01:07, 4.79s/it] 11it [01:13, 4.16s/it] 13it [01:19, 3.75s/it] 15it [01:25, 3.48s/it] 17it [01:31, 3.30s/it] 19it [01:37, 3.17s/it] 21it [01:42, 3.07s/it] 23it [01:48, 2.99s/it] 25it [01:54, 2.93s/it] 27it [01:59, 2.89s/it] 29it [02:05, 2.87s/it] 31it [02:10, 2.86s/it] 33it [02:16, 2.85s/it] 35it [02:22, 2.85s/it] 37it [02:27, 2.85s/it] 39it [02:33, 2.85s/it] 41it [02:39, 2.86s/it] 43it [02:45, 2.86s/it] 45it [02:50, 2.87s/it] 47it [02:56, 2.87s/it] 49it [03:02, 2.87s/it] 51it [03:08, 2.86s/it] 53it [03:13, 2.86s/it] 55it [03:19, 2.86s/it] 100%|██████████| 28/28 [02:40<00:00, 5.73s/it] 56it [03:19, 3.56s/it]
./fin_pdf_data/Naruto.pdf
0it [00:00, ?it/s] 1it [03:06, 186.42s/it] 3it [03:12, 50.41s/it] 5it [03:17, 26.00s/it] 7it [03:23, 16.23s/it] 9it [03:29, 11.25s/it] 11it [03:35, 8.36s/it] 13it [03:41, 6.59s/it] 15it [03:47, 5.42s/it] 17it [03:53, 4.63s/it] 19it [03:59, 4.10s/it] 21it [04:05, 3.73s/it] 23it [04:11, 3.48s/it] 25it [04:16, 3.29s/it] 27it [04:22, 3.17s/it] 29it [04:28, 3.08s/it] 31it [04:34, 3.01s/it] 33it [04:39, 2.97s/it] 35it [04:45, 2.96s/it] 37it [04:51, 2.94s/it] 39it [04:57, 2.93s/it] 41it [05:03, 2.94s/it] 43it [05:08, 2.93s/it] 45it [05:14, 2.94s/it] 47it [05:20, 2.92s/it] 49it [05:26, 2.92s/it] 51it [05:32, 2.91s/it] 53it [05:37, 2.90s/it] 55it [05:43, 2.91s/it] 57it [05:49, 2.90s/it] 59it [05:55, 2.91s/it] 61it [06:01, 2.90s/it] 63it [06:07, 2.90s/it] 65it [06:12, 2.89s/it] 67it [06:18, 2.89s/it] 69it [06:24, 2.89s/it] 71it [06:30, 2.89s/it] 73it [06:35, 2.90s/it] 75it [06:41, 2.90s/it] 77it [06:47, 2.91s/it] 79it [06:53, 2.91s/it] 81it [06:59, 2.90s/it] 83it [07:04, 2.90s/it] 85it [07:10, 2.90s/it] 87it [07:16, 2.92s/it] 89it [07:22, 2.91s/it] 91it [07:28, 2.93s/it] 93it [07:34, 2.92s/it] 95it [07:40, 2.93s/it] 97it [07:45, 2.92s/it] 99it [07:51, 2.92s/it] 101it [07:57, 2.91s/it] 103it [08:03, 2.91s/it] 105it [08:09, 2.92s/it] 107it [08:15, 2.91s/it] 109it [08:20, 2.92s/it] 111it [08:26, 2.91s/it] 113it [08:32, 2.90s/it] 115it [08:38, 2.89s/it] 117it [08:43, 2.88s/it] 119it [08:49, 2.88s/it] 121it [08:55, 2.89s/it] 123it [09:01, 2.90s/it] 125it [09:07, 2.90s/it] 127it [09:13, 2.92s/it] 129it [09:18, 2.91s/it] 131it [09:24, 2.90s/it] 133it [09:30, 2.90s/it] 135it [09:36, 2.90s/it] 137it [09:41, 2.90s/it] 139it [09:47, 2.90s/it] 141it [09:53, 2.91s/it] 143it [09:59, 2.91s/it] 145it [10:05, 2.92s/it] 147it [10:11, 2.91s/it] 149it [10:16, 2.91s/it] 151it [10:22, 2.91s/it] 153it [10:28, 2.91s/it] 155it [10:34, 2.92s/it] 157it [10:40, 2.92s/it] 159it [10:46, 2.92s/it] 161it [10:51, 2.91s/it] 163it [10:57, 2.91s/it] 165it [11:03, 2.90s/it] 167it [11:09, 2.90s/it] 169it [11:15, 2.89s/it] 171it [11:20, 2.89s/it] 173it [11:26, 2.91s/it] 175it [11:32, 2.91s/it] 177it [11:38, 2.91s/it] 179it [11:44, 2.90s/it] 181it [11:49, 2.90s/it] 183it [11:55, 2.89s/it] 185it [12:01, 2.88s/it] 187it [12:07, 2.89s/it] 189it [12:13, 2.89s/it] 191it [12:18, 2.91s/it] 193it [12:24, 2.90s/it] 195it [12:30, 2.91s/it] 197it [12:36, 2.91s/it] 100%|██████████| 99/99 [09:35<00:00, 5.82s/it] 198it [12:36, 3.82s/it]
./fin_pdf_data/short-stories-for-children-ingles-primaria-continuemos-estudiando.pdf
0it [00:00, ?it/s] 1it [00:15, 15.51s/it] 3it [00:21, 6.19s/it] 5it [00:27, 4.51s/it] 7it [00:33, 3.85s/it] 9it [00:39, 3.52s/it] 11it [00:44, 3.33s/it] 13it [00:50, 3.20s/it] 15it [00:56, 3.11s/it] 100%|██████████| 8/8 [00:46<00:00, 5.86s/it] 16it [00:56, 3.55s/it]
We've now ingested more than 550 pages!
retrieval
In this section, we'll see:
- Performance of ColPali in retrieving the correct document. In this case, it'll use MaxSim operation across ALL ingested document pages.
- Optimizing lookup time by reducing search space using LanceDB FTS ( ColPali as a FTS Reranker )
- Optimizing lookup time by reducing search space using LanceDB Semantic search ( ColPali as a Vector Reranker )
ColPali Retrieval
37.21562123298645
ColPali MaxSims as a Reranking step
There are a couple of hacks that can be used to reduce the search space by filtering out some results. We'll reduce the search space a fifth (top 100 docs)
1. If text field is available (ColPali as FTS reranker)
In case you're able to extract the text from the pdf ( Here we're not referring to OCR. Some pdf encodings allow reading text directly from the file), you can create FTS index on the text column to try and reduce the search space. This'll effectively make the ColPali MaxSims operation a reranking step for FTS. In this example, we'll get the top 100 FTS matches, which brings down the search space to about 1/5th.
6.828559398651123
That reduced the time taken to a fifth as expected!
2. Reducing the search space using similarity search (ColPali as vector search reranker)
A vision retriever pipeline ideally shouldn't be dependent on being able to parse the the text from the documents as it defeats the purpose of being a 1-shot retrieval method.
Hypothesis
Remember, 128 dim vector projections are derived from the language model part of ColPali. Although query dim (in this case 25x128) isn't the same as the doc path embedding dim (1030x128). But because these are the representations of the same model, they might be able to capture similarity between query and the doc patches. So we can flatten them out, zero pad the query embeddings to match the doc patch embeddings, and run vector search to filter out the top_k and reduce the search space for ColPali MaxSims reranking.
10.63455319404602
Let's try a couple more queries from different books
time taken 9.661753177642822
time taken 9.042099237442017