With Index Pipelines
Chunk Large Documents via Ingest pipelines
This interactive notebook will:
- load the model "sentence-transformers__all-minilm-l6-v2" from Hugging Face and into Elasticsearch ML Node
- create an index and ingest pipeline that will chunk large fields into smaller passages and vectorize them using the model
- perform a search and return docs with the most relevant passages
Prefer the semantic_text field type
Elasticsearch version 8.15 introduced the semantic_text field type which handles the chunking process behind the scenes. Before continuing with this notebook, we highly recommend looking into this:
Create Elastic Cloud deployment
If you don't have an Elastic Cloud deployment, sign up here for a free trial.
Once logged in to your Elastic Cloud account, go to the Create deployment page and select Create deployment. Leave all settings with their default values.
Install packages
To get started, we'll need to connect to our Elastic deployment using the Python client. Because we're using an Elastic Cloud deployment, we'll use the Cloud ID to identify our deployment.
First we need to install the elasticsearch Python client.
Initialize the Elasticsearch client
Now we can instantiate the Elasticsearch python client, providing the cloud id and password in your deployment.
If you're running Elasticsearch locally or self-managed, you can pass in the Elasticsearch host instead. Read more on how to connect to Elasticsearch locally.
Confirm that the client has connected with this test.
Load Model from hugging face
The first thing you will need is a model to create the text embeddings out of the chunks, you can use whatever you would like, but this example will run end to end on the minilm-l6-v2 model. With an Elastic Cloud cluster created or another Elasticsearch cluster ready, we can upload the text embedding model using the eland library.
Chunk and Infer in pipeline
The next step is to define an ingest pipeline to break up the text field into chunks of text stored in the passages field. This pipeline has two processors, the first script processor breaks up the text field into an array of sentences stored in the passages field via a regular expression. For further research read up on regular expression advanced features such as negative lookbehind and positive lookbehind to understand how it tries to properly split on sentence boundaries, not split on Mr. or Mrs. or Ms., and keep the punctuation with the sentence. It also tries to concatenate the sentence chunks back together as long as the total string length is under the parameter passed to the script. The next for each processor runs the text embedding model on each sentence via an inferrence processor:
ObjectApiResponse({'acknowledged': True}) Setup Index
Next step is to prepare the mappings to handle the array of sentences and vector objects that will be created during the ingest pipeline. For this particular text embedding model the dimensions are 384 and dot_product similarity will be used for nearest neighbor calculations:
ObjectApiResponse({'acknowledged': True, 'shards_acknowledged': True, 'index': 'chunk_passages_example'}) Add some Documents
Now we can add documents with large amounts of text in body_content and automatically have them chunked, and each chunk text embedded into vectors by the model:
Aside: Pretty printing Elasticsearch responses
Your API calls will return hard-to-read nested JSON.
We'll create a little function called pretty_response to return nice, human-readable outputs from our examples.
Making queries
To search the data and return what chunk matched the query best you use inner_hits with the knn clause to return just that best matching chunk of the document in the hits output from the query.
Below you will see the response which returns the best document and the most relevant passage.
ID: 0 Doc Title: Work From Home Policy Passage Text: Effective: March 2020 Purpose The purpose of this full-time work-from-home policy is to provide guidelines and support for employees to conduct their work remotely, ensuring the continuity and productivity of business operations during the COVID-19 pandemic and beyond. Scope This policy applies to all employees who are eligible for remote work as determined by their role and responsibilities. Score: 0.85496104 --- ID: 7 Doc Title: Intellectual Property Policy Passage Text: This policy aims to encourage creativity and innovation while ensuring that the interests of both the company and its employees are protected. Scope This policy applies to all employees, including full-time, part-time, temporary, and contract employees. Definitions a. Score: 0.7664343 --- ID: 4 Doc Title: Company Vacation Policy Passage Text: Purpose The purpose of this vacation policy is to outline the guidelines and procedures for requesting and taking time off from work for personal and leisure purposes. Score: 0.725452 ---