Ecommerce Dense Sparse Project
Lexical and Semantic Search with Elasticsearch
In the following examples, we will explore various approaches to retrieving information using Elasticsearch - focusing specifically on full text search, semantic search, and a hybrid combination of both.
To accomplish this, this example demonstrates various search scenarios on a dataset generated to simulate e-commerce product information.
This dataset contains over 2,500 products, each with a description. These products are categorized into 76 distinct product categories, with each category containing a varying number of products.
Here is a sample of an object from the dataset:
{
"product": "Samsung 49-inch Curved Gaming Monitor",
"description": "is a curved gaming monitor with a high refresh rate and AMD FreeSync technology.",
"category": "Monitors"
}
We will consume the dataset from a JSON file into Elasticsearch using modern consumption patterns. We will then perform a series of search operations to demonstrate the different search strategies.
🧰 Requirements
For this example, you will need:
- Python 3.11 or later
- The Elastic Python client
- Elastic 9.0 deployment or later on either a local, cloud, or serverless environment
We'll be using Elastic Cloud. You can use a free trial here to get started.
Setup Elasticsearch environment:
To get started, we'll need to connect to our Elastic deployment using the Python client.
Because we're using an Elastic Cloud deployment, we'll use the Cloud Endpoint and Cloud API Key to identify our deployment. These may be found within Kibana by following the instructions here.
Import the required packages
We will import the following packages:
Elasticsearch: a client library for Elasticsearch actionsbulk: a function to perform Elasticsearch actions in bulkgetpass: a module for receiving Elasticsearch credentials via text promptjson: a module for reading and writing JSON datapandas,display,Markdown: for data visualization and markdown formatting
📚 Instantiating the Elasticsearch Client
First we prompt the user for their Elastic Endpoint URL and Elastic API Key.
Then we create a client object that instantiates an instance of the Elasticsearch class.
Lastly, we verify that our client is connected to our Elasticsearch instance by calling client.ping().
🔐 NOTE:
getpassenables us to securely prompt the user for credentials without echoing them to the terminal, or storing it in memory.
Prepare our embedding model workflow
Next we ensure our embedding models are available in Elasticsearch. We will use Elastic's provided e5_multilingual_small and elser_V2 models to provide dense and sparse vectoring, respectively. Using these models out of the box will ensure they are up-to-date and ready for integration with Elasticsearch.
Other models may be uploaded and deployed using Eland or integrated using the inference endpoint API to connect to third-party models.
Create an inference pipeline
This function will create an ingest pipeline with inference processors to use ELSER (sparse_vector) and e5_multilingual_small (dense_vector) to infer against data that will be ingested in the pipeline. This allows us to automatically generate embeddings for the product descriptions when they are indexed into Elasticsearch.
Index documents
The ecommerce-search index we are creating will include fields to support dense and sparse vector storage and search.
We define the e5_description_vector and the elser_description_vector fields to store the inference pipeline results.
The field type in e5_description_vector is a dense_vector. The .e5_multilingual_small model has an embedding size of 384, so the dimension of the vector (dims) is set to 384.
We also add an elser_description_vector field type to support the sparse_vector output from our .elser_model_2_linux-x86_64 model. No further configuration is needed for this field for our use case.
Attach Pipeline to Index
Lets connect our pipeline to the index. This updates the settings of our index to use the pipeline we previously defined as the default.
Load documents
We load the contents ofproducts-ecommerce.json into the ecommerce-search index. We will use the bulk helper function to efficiently index our documents en masse.
Text Analysis
The classic way documents are ranked for relevance by Elasticsearch based on a text query uses the Lucene implementation of the BM25 model, a sparse model for lexical search. This method follows the traditional approach for text search, looking for exact term matches.
To make this search possible, Elasticsearch converts text field data into a searchable format by performing text analysis.
Text analysis is performed by an analyzer, a set of rules to govern the process of extracting relevant tokens for searching. An analyzer must have exactly one tokenizer. The tokenizer receives a stream of characters and breaks it up into individual tokens (usually individual words.)
Standard Analyzer
In the example below we are using the default analyzer, the standard analyzer, which works well for most use cases as it provides English grammar based tokenization. Tokenization enables matching on individual terms, but each token is still matched literally.
Stop Analyzer
If you want to personalize your search experience you can choose a different built-in analyzer. For example, by updating the code to use the stop analyzer it will break the text into tokens at any non-letter character with support for removing stop words.
Custom Analyzer
When the built-in analyzers do not fulfill your needs, you can create a custom analyzer ], which uses the appropriate combination of zero or more character filters, a tokenizer and zero or more token filters.
In the below example that combines a tokenizer and token filters, the text will be lowercased by the lowercase filter before being processed by the synonyms token filter.
Note: you cannot pass a custom analyzer definition inline to analyze. Define the analyzer in your index settings, then reference it by name in the analyze call. For this reason we will create a temporary index to store the analyzer.
Text Analysis Results
In the table below, we can observe that analyzers both included with Elasticsearch and custom made may be included with your search requests to improve the quality of your search results by reducing or refining the content being searched. Attention should be paid to your particular use case and the needs of your users.
Search
The remainder of this notebook will cover the following search types:
- Lexical Search
- Semantic Search
- ELSER Semantic Search (Sparse Vector)
- E5 Semantic Search (Dense Vector)
- ELSER Semantic Search with
semantic_text - E5 Semantic Search with
semantic_text
- Hybrid Search
- E5 + Lexical (linear combination)
- E5 + Lexical (RRF)
- ELSER + Lexical (linear combination)
- ELSER + Lexical (RRF)
- ES|QL Search
- Semantic Search ES|QL
- ELSER ES|QL
- E5 ES|QL
- ELSER ES|QL with
semantic_text - E5 ES|QL with
semantic_text
Lexical Search
Our first search will be a straightforward BM25 text search within the description field. We are storing all of our results in a results_list for a final comparison at the end of the notebook. A convenience function to display the results is also defined.
Semantic Search with Dense Vector
Semantic Search with Sparse Vector
Semantic Search with semantic_text Type (ELSER)
Semantic Search with semantic_text Type (e5)
Hybrid Search - BM25 + semantic_text Type
Hybrid Search - BM25 + Dense Vector linear combination
Hybrid Search - BM25 + Dense Vector Reverse Reciprocal Fusion (RRF)
Reciprocal rank fusion (RRF) is a method for combining multiple result sets with different relevance indicators into a single result set. RRF requires no tuning, and the different relevance indicators do not have to be related to each other to achieve high-quality results.
Hybrid Search - BM25 + Sparse Vector linear combination
Hybrid Search - BM25 + Sparse Vector Reciprocal Rank Fusion (RRF)
ES|QL Search
Elastic offers its own query language called ES|QL. ES|QL is a SQL-like query language that allows you to search and analyze data in Elasticsearch. Further information can be found in the official documentation.
Lexical Search with ES|QL
This demonstrates the lexical search capabilities of ES|QL using the match function. The function MATCH specifically searches for matches in a query string within a specified field. In the example below, we search for documents containing the phrase "Comfortable furniture for a large balcony" in the description field.
Semantic Search with ES|QL
To perform a semantic search using ES|QL, use the semantic_text type for your query. This will run a similarity search based on the semantic meaning of the text, rather than the lexical (word-level) matching of the text type. Similar to the ease of performing a search with semantic search using the semantic_text type with the Python client, the ES|QL query is simple to write and understand.
Compiled Results
Here are the results of the previous searches. We can see that all of the results return approximately the same the products.
As can be seen in the results, the semantic search query provides more relevant results than the lexical search query. This is due to the semantic search query using the semantic_text field, which is based on the dense vector representation of the text, while the lexical search query uses the description field, which is based on the lexical representation of the text. Nuances and context are better captured by the semantic search query, making it more effective for finding relevant results.
Conclusion
It should be noted that while the semantic search query provides more relevant results, it is also more computationally expensive than the lexical search query. This is because the semantic search query requires the calculation of vector representations, which can be computationally intensive.
Ultimately, it is recommended to use the semantic_text type when implementing semantic search for a few key reasons:
- Query structure is simple and easy to understand.
- Implementing the semantic_text type requires minimal changes to the index mapping and query.
- Setting up an ingest pipeline and inference endpoint is unnecessary.
Using spare_vector and dense_vector types are more complex and requires additional setup, but can be useful in certain scenarios where semantic search needs to be customized beyond standard semantic text search. This could be a change in the similarity algorithm, use of different vectorization models, or any necessary preprocessing steps.
Hybrid search retains the power of both lexical and semantic search, allowing for a more flexible and effective search experience. With hybrid search, you can balance the trade-off between relevance and performance, making it a more practical choice for production environments. This should be considered the default approach for search.