Elastic Multilingual E5

Multilingual E5

multilingual-e5openai-chatgptlangchain-pythonchatgptgenaielasticsearchelasticopenaiAIchatlogvectordatabasePythonsearchgenaistacksupporting-blog-contentvectorelasticsearch-labslangchainapplications

alph-notebooks/elasticsearch-labs / multilingual-e5.ipynb

Export

Run Notebooks

Contents

No cells yet

Add cells to see them here

Multilingual vector search with E5 embedding models

In this example we'll use a multilingual embedding model multilingual-e5-base to perform search on a toy dataset of mixed language documents. The examples in this notebook follow the blog post of the same title: Multilingual vector search with E5 embedding models.

🧰 Requirements

For this example, you will need:

An Elastic Cloud deployment with an ML node (min. 8 GB memory)
- We'll be using Elastic Cloud for this example (available with a free trial)

Create Elastic Cloud deployment

If you don't have an Elastic Cloud deployment, sign up here for a free trial.

Go to the Create deployment page
- Select Create deployment
- Use the default node types for Elasticsearch and Kibana
- Add an ML node with 8 GB memory (the multilingual E5 base model is larger than most)

Setup Elasticsearch environment

To get started, we'll need to connect to our Elastic deployment using the Python client. Because we're using an Elastic Cloud deployment, we'll use the Cloud ID to identify our deployment.

First we need to pip install the packages we need for this example.

[ ]

Next we need to import the elasticsearch module and the getpass module. getpass is part of the Python standard library and is used to securely prompt for credentials.

[3]

Now we can instantiate the Python Elasticsearch client. First we prompt the user for their password and Cloud ID.

🔐 NOTE: getpass enables us to securely prompt the user for credentials without echoing them to the terminal, or storing it in memory.

Then we create a client object that instantiates an instance of the Elasticsearch class.

[ ]

Setup emebdding model

Next we upload the E5 multilingual embedding model into Elasticsearch and create an ingest pipeline to automatically create embeddings when ingesting documents. For more details on this process, please see the blog post: How to deploy NLP: Text Embeddings and Vector Search

[ ]

Index documents

We need to add a field to support dense vector storage and search. Note the passage_embedding.predicted_value field below, which is used to store the dense vector representation of the passage field, and will be automatically populated by the inference processor in the pipeline created above. The passage_embedding field will also store metadata from the inference process.

[ ]

Now that we have the pipeline and mappings ready, we can index our documents. This is of course just a demo so we only index the few toy examples from the blog post.

[ ]

Multilingual semantic search

[ ]


ID: doc1
Language: en
Passage: passage: I sat on the bank of the river today.
Score: 0.88001645

ID: doc2
Language: de
Passage: passage: Ich bin heute zum Flussufer gegangen.
Score: 0.87662137

[ ]


ID: doc4
Language: de
Passage: passage: Ich saß heute bei der Bank und wartete auf mein Geld.
Score: 0.8967148

ID: doc3
Language: en
Passage: passage: I walked to the bank today to deposit money.
Score: 0.8863925

[ ]


ID: doc3
Language: en
Passage: passage: I walked to the bank today to deposit money.
Score: 0.87475425

ID: doc2
Language: de
Passage: passage: Ich bin heute zum Flussufer gegangen.
Score: 0.8741033

[ ]


ID: doc4
Language: de
Passage: passage: Ich saß heute bei der Bank und wartete auf mein Geld.
Score: 0.85991657

ID: doc1
Language: en
Passage: passage: I sat on the bank of the river today.
Score: 0.8561436