Improve Retrieval By Embedding Metadata
🔍 Improve retrieval by embedding meaningful metadata 🏷️
Notebook by Stefano Fiorucci
In this notebook, I do some experiments on embedding meaningful metadata to improve Document retrieval.
Load data from Wikipedia
We are going to download the Wikipedia pages related to some bands, using the python library wikipedia.
These pages are converted into Haystack Documents.
🔧 Setup the experiment
Utility functions to create Pipelines
The indexing Pipeline transforms the Documents and stores them (with vectors) in a Document Store. The retrieval Pipeline takes a query as input and perform the vector search.
I build some utility functions to create different indexing and retrieval Pipelines.
In fact, I am interested in comparing the standard approach (where we only embed text) with the embedding metadata strategy (we embed text + meaningful metadata).
Create the Pipelines
Let's define 2 Document Stores, to compare the different approaches.
Now, I create the 2 indexing pipelines and run them.
Create the 2 retrieval pipelines.
🧪 Run the experiment!
❌ the retrieved Documents seem irrelevant
✅ the first Document is relevant
❌ the retrieved Documents seem irrelevant
✅ some Documents are relevant
⚠️ Notes of caution
- This technique is not a silver bullet
- It works well when the embedded metadata are meaningful and distinctive
- I would say that the embedded metadata should be meaningful from the perspective of the embedding model. For example, I don't expect embedding numbers to work well.