LanceDB Main

Main

archived_examplesagentsllmsvector-databaseinstruct-multitasklancedbgptopenaiAImultimodal-aimachine-learningembeddingsfine-tuningexamplesdeep-learninggpt-4-visionllama-indexragmultimodallangchainlancedb-recipes

alph-notebooks/lancedb-recipes / main.ipynb

Export

Run Notebooks

Contents

No cells yet

Add cells to see them here

InstructOR - A multitask custom embedding model for task based applications, made easier with LanceDB

instruct

Installing all dependencies

[1]

[2]

If you want to calculate customized embeddings for specific sentences, you may follow the unified template to write instructions:

"Represent the [domain] [text_type] for [task_objective]:"

Here are some examples:

"Represent the Science sentence:"
"Represent the Financial statement:"
"Represent the Wikipedia document for retrieval:"
"Represent the Wikipedia question for retrieving supporting documents:"

Importing neccessary libraries

[3]

Calling the embedding model from LanceDB embedding's API

[4]

/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_token.py:88: UserWarning: 
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
  warnings.warn(

.gitattributes:   0%|          | 0.00/1.48k [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/270 [00:00<?, ?B/s]

2_Dense/config.json:   0%|          | 0.00/115 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/2.36M [00:00<?, ?B/s]

README.md:   0%|          | 0.00/66.2k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.55k [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/439M [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/2.43k [00:00<?, ?B/s]

modules.json:   0%|          | 0.00/461 [00:00<?, ?B/s]

load INSTRUCTOR_Transformer

/usr/local/lib/python3.10/dist-packages/torch/_utils.py:831: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  return self.fget.__get__(instance, owner)()

max_seq_length  512

Adding Data to the Table

[6]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

First use case - Semantic Search with LanceDB

[7]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

[8]

                                              vector  \
0  [-0.024510663, 0.0005563084, 0.028840268, 0.08...   

                                                text  _distance  
0  Amoxicillin is an antibiotic medication common...   0.163671

Same Input Data with Different Instruction Pair

[9]

load INSTRUCTOR_Transformer
max_seq_length  512

[10]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

[11]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

[12]

                                              vector  \
0  [-0.02448329, 0.00093284156, 0.033273738, 0.07...   

                                                text  _distance  
0  Amoxicillin is an antibiotic medication common...    0.18135

We can see that the _distance value for different instructions are different, this clearly indicates that the instructions have some effect on the embedding

Second use case - Question Answering with LanceDB

Calling embedding model with different instruction pair

[13]

load INSTRUCTOR_Transformer
max_seq_length  512

[14]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

[15]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

[16]

                                              vector  \
0  [0.02184453, 0.0017777232, 0.022723947, 0.0497...   

                                                text  _distance  
0  A cinema, also known as a movie theater or mov...   0.131036

Thanks, for more such examples, please visit LanceDB