Multilingual E5
Multilingual vector search with E5 embedding models
In this example we'll use a multilingual embedding model multilingual-e5-base to perform search on a toy dataset of mixed language documents. The examples in this notebook follow the blog post of the same title: Multilingual vector search with E5 embedding models.
🧰 Requirements
For this example, you will need:
- An Elastic Cloud deployment with an ML node (min. 8 GB memory)
- We'll be using Elastic Cloud for this example (available with a free trial)
Create Elastic Cloud deployment
If you don't have an Elastic Cloud deployment, sign up here for a free trial.
- Go to the Create deployment page
- Select Create deployment
- Use the default node types for Elasticsearch and Kibana
- Add an ML node with 8 GB memory (the multilingual E5 base model is larger than most)
Setup Elasticsearch environment
To get started, we'll need to connect to our Elastic deployment using the Python client. Because we're using an Elastic Cloud deployment, we'll use the Cloud ID to identify our deployment.
First we need to pip install the packages we need for this example.
Next we need to import the elasticsearch module and the getpass module.
getpass is part of the Python standard library and is used to securely prompt for credentials.
Now we can instantiate the Python Elasticsearch client. First we prompt the user for their password and Cloud ID.
🔐 NOTE: getpass enables us to securely prompt the user for credentials without echoing them to the terminal, or storing it in memory.
Then we create a client object that instantiates an instance of the Elasticsearch class.
Setup emebdding model
Next we upload the E5 multilingual embedding model into Elasticsearch and create an ingest pipeline to automatically create embeddings when ingesting documents. For more details on this process, please see the blog post: How to deploy NLP: Text Embeddings and Vector Search
Index documents
We need to add a field to support dense vector storage and search.
Note the passage_embedding.predicted_value field below, which is used to store the dense vector representation of the passage field, and will be automatically populated by the inference processor in the pipeline created above. The passage_embedding field will also store metadata from the inference process.
Now that we have the pipeline and mappings ready, we can index our documents. This is of course just a demo so we only index the few toy examples from the blog post.
Multilingual semantic search
ID: doc1 Language: en Passage: passage: I sat on the bank of the river today. Score: 0.88001645 ID: doc2 Language: de Passage: passage: Ich bin heute zum Flussufer gegangen. Score: 0.87662137
ID: doc4 Language: de Passage: passage: Ich saß heute bei der Bank und wartete auf mein Geld. Score: 0.8967148 ID: doc3 Language: en Passage: passage: I walked to the bank today to deposit money. Score: 0.8863925
ID: doc3 Language: en Passage: passage: I walked to the bank today to deposit money. Score: 0.87475425 ID: doc2 Language: de Passage: passage: Ich bin heute zum Flussufer gegangen. Score: 0.8741033
ID: doc4 Language: de Passage: passage: Ich saß heute bei der Bank und wartete auf mein Geld. Score: 0.85991657 ID: doc1 Language: en Passage: passage: I sat on the bank of the river today. Score: 0.8561436