Notebooks
E
Elastic
Diversification

Diversification

openai-chatgptlangchain-pythonchatgptgenaielasticsearchelasticopenaiAIchatlogvectordatabasePythonsearchgenaistackresult-diversificationsupporting-blog-contentvectorelasticsearch-labslangchainapplications

Result diversification with Elasticsearch

This notebook demonstrates:

  1. Loading fashion dataset
  2. Index in Elasticsearch using image search
  3. Search items with a broad search term
  4. Apply result diversification with the MMR algorithm to the results.

Check out our blog post on this topic to learn more about

1. Setup and Dependencies

[1]
[2]

2. Load Configuration

Create a configuration file elastic_config.env in this format to authenticate with JINA and the Elastic Cluster.

	ELASTIC_API_KEY=<ELASTIC_KEY>
ELASTIC_HOST=<HOST_URL>
JINA_API_KEY=<JINA_KEY>

[3]
Configuration loaded successfully

3. Load Dataset and Extract ID & Image URLs

[4]
Path to dataset files: /Users/peter/.cache/kagglehub/datasets/paramaggarwal/fashion-product-images-dataset/versions/1
Loaded 44446 total products

Filtered to 2694 bottomwear products
[5]
Extracted 2693 products with valid IDs and image URLs

Limited to 1000 items for demo

Sample items (alphabetically sorted):
  - Femella Women Off White Shorts (Shorts, Off White)
  - Nike Women Strong Poly Black Capri (Capris, Black)
  - Flying Machine Men Blue Jeans (Jeans, Blue)
  - Urban Yoga Men Black Shorts (Shorts, Black)
  - Doodle Girls Lace Bow LT.Pink Leggings (Leggings, Pink)

4. Create Image Embeddings with JINA API

[6]
Getting embeddings...
Getting embeddings: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1000/1000 [05:25<00:00,  3.08it/s]
Retrieved 1000 embeddings

[7]
Original products: 1000
After filtering similar items: 758
Removed 242 similar items

5. Setup Elasticsearch Index

[8]
Deleted existing index 'fashion_images'
Created index 'fashion_images'

6. Index Documents with Image Vectors

[9]
start
Indexing images: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 758/758 [00:26<00:00, 28.72it/s]
Successfully indexed 758 documents

[ ]

7. Query Images with Text Search

[13]
Creating text embedding for: 'pants'

Searching for items similar to: 'pants'
Found 150 similar images

8. Display Search Results\n\nShowing results for text search: "pants"

[15]

9. Reranking with Maximum Marginal Relevance (MMR)

MMR is a diversity-promoting algorithm that balances:

Relevance: How well items match the query
Diversity: How different items are from each other
The algorithm iteratively selects items that are relevant to the query but different from already selected items.

[16]
[ ]
[ ]
[ ]