Main
archived_examplesagentsllmsvector-databaseinstruct-multitasklancedbgptopenaiAImultimodal-aimachine-learningembeddingsfine-tuningexamplesdeep-learninggpt-4-visionllama-indexragmultimodallangchainlancedb-recipes
Export
InstructOR - A multitask custom embedding model for task based applications, made easier with LanceDB

Installing all dependencies
[1]
[2]
If you want to calculate customized embeddings for specific sentences, you may follow the unified template to write instructions:
"Represent the [domain] [text_type] for [task_objective]:"
Here are some examples:
- "Represent the Science sentence:"
- "Represent the Financial statement:"
- "Represent the Wikipedia document for retrieval:"
- "Represent the Wikipedia question for retrieving supporting documents:"
Importing neccessary libraries
[3]
Calling the embedding model from LanceDB embedding's API
[4]
/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_token.py:88: UserWarning: The secret `HF_TOKEN` does not exist in your Colab secrets. To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session. You will be able to reuse this secret in all of your notebooks. Please note that authentication is recommended but still optional to access public models or datasets. warnings.warn(
.gitattributes: 0%| | 0.00/1.48k [00:00<?, ?B/s]
1_Pooling/config.json: 0%| | 0.00/270 [00:00<?, ?B/s]
2_Dense/config.json: 0%| | 0.00/115 [00:00<?, ?B/s]
pytorch_model.bin: 0%| | 0.00/2.36M [00:00<?, ?B/s]
README.md: 0%| | 0.00/66.2k [00:00<?, ?B/s]
config.json: 0%| | 0.00/1.55k [00:00<?, ?B/s]
config_sentence_transformers.json: 0%| | 0.00/122 [00:00<?, ?B/s]
pytorch_model.bin: 0%| | 0.00/439M [00:00<?, ?B/s]
sentence_bert_config.json: 0%| | 0.00/53.0 [00:00<?, ?B/s]
special_tokens_map.json: 0%| | 0.00/2.20k [00:00<?, ?B/s]
spiece.model: 0%| | 0.00/792k [00:00<?, ?B/s]
tokenizer.json: 0%| | 0.00/2.42M [00:00<?, ?B/s]
tokenizer_config.json: 0%| | 0.00/2.43k [00:00<?, ?B/s]
modules.json: 0%| | 0.00/461 [00:00<?, ?B/s]
load INSTRUCTOR_Transformer
/usr/local/lib/python3.10/dist-packages/torch/_utils.py:831: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() return self.fget.__get__(instance, owner)()
max_seq_length 512
Adding Data to the Table
[6]
Batches: 0%| | 0/1 [00:00<?, ?it/s]
First use case - Semantic Search with LanceDB
[7]
Batches: 0%| | 0/1 [00:00<?, ?it/s]
[8]
vector \
0 [-0.024510663, 0.0005563084, 0.028840268, 0.08...
text _distance
0 Amoxicillin is an antibiotic medication common... 0.163671
Same Input Data with Different Instruction Pair
[9]
load INSTRUCTOR_Transformer max_seq_length 512
[10]
Batches: 0%| | 0/1 [00:00<?, ?it/s]
[11]
Batches: 0%| | 0/1 [00:00<?, ?it/s]
[12]
vector \
0 [-0.02448329, 0.00093284156, 0.033273738, 0.07...
text _distance
0 Amoxicillin is an antibiotic medication common... 0.18135
We can see that the _distance value for different instructions are different, this clearly indicates that the instructions have some effect on the embedding
Second use case - Question Answering with LanceDB
Calling embedding model with different instruction pair
[13]
load INSTRUCTOR_Transformer max_seq_length 512
[14]
Batches: 0%| | 0/1 [00:00<?, ?it/s]
[15]
Batches: 0%| | 0/1 [00:00<?, ?it/s]
[16]
vector \
0 [0.02184453, 0.0017777232, 0.022723947, 0.0497...
text _distance
0 A cinema, also known as a movie theater or mov... 0.131036
Thanks, for more such examples, please visit LanceDB