Diversification
openai-chatgptlangchain-pythonchatgptgenaielasticsearchelasticopenaiAIchatlogvectordatabasePythonsearchgenaistackresult-diversificationsupporting-blog-contentvectorelasticsearch-labslangchainapplications
Export
Result diversification with Elasticsearch
This notebook demonstrates:
- Loading fashion dataset
- Index in Elasticsearch using image search
- Search items with a broad search term
- Apply result diversification with the MMR algorithm to the results.
Check out our blog post on this topic to learn more about
1. Setup and Dependencies
[1]
[2]
2. Load Configuration
Create a configuration file elastic_config.env in this format to authenticate with JINA and the Elastic Cluster.
ELASTIC_API_KEY=<ELASTIC_KEY>
ELASTIC_HOST=<HOST_URL>
JINA_API_KEY=<JINA_KEY>
[3]
Configuration loaded successfully
3. Load Dataset and Extract ID & Image URLs
[4]
Path to dataset files: /Users/peter/.cache/kagglehub/datasets/paramaggarwal/fashion-product-images-dataset/versions/1 Loaded 44446 total products Filtered to 2694 bottomwear products
[5]
Extracted 2693 products with valid IDs and image URLs Limited to 1000 items for demo Sample items (alphabetically sorted): - Femella Women Off White Shorts (Shorts, Off White) - Nike Women Strong Poly Black Capri (Capris, Black) - Flying Machine Men Blue Jeans (Jeans, Blue) - Urban Yoga Men Black Shorts (Shorts, Black) - Doodle Girls Lace Bow LT.Pink Leggings (Leggings, Pink)
4. Create Image Embeddings with JINA API
[6]
Getting embeddings...
Getting embeddings: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1000/1000 [05:25<00:00, 3.08it/s]
Retrieved 1000 embeddings
[7]
Original products: 1000 After filtering similar items: 758 Removed 242 similar items
5. Setup Elasticsearch Index
[8]
Deleted existing index 'fashion_images' Created index 'fashion_images'
6. Index Documents with Image Vectors
[9]
start
Indexing images: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 758/758 [00:26<00:00, 28.72it/s]
Successfully indexed 758 documents
[ ]
7. Query Images with Text Search
[13]
Creating text embedding for: 'pants' Searching for items similar to: 'pants' Found 150 similar images
8. Display Search Results\n\nShowing results for text search: "pants"
[15]
9. Reranking with Maximum Marginal Relevance (MMR)
MMR is a diversity-promoting algorithm that balances:
Relevance: How well items match the query
Diversity: How different items are from each other
The algorithm iteratively selects items that are relevant to the query but different from already selected items.
[16]
[ ]
[ ]
[ ]