deepset Improve Retrieval By Embedding Metadata

Improve Retrieval By Embedding Metadata

agentic-aiagenticagentsgenaiAIhaystack-cookbookgenai-usecaseshaystack-ainotebooksPythonragai-tools

alph-notebooks/haystack-cookbook / improve-retrieval-by-embedding-metadata.ipynb

Export

Run Notebooks

Contents

No cells yet

Add cells to see them here

🔍 Improve retrieval by embedding meaningful metadata 🏷️

Notebook by Stefano Fiorucci

In this notebook, I do some experiments on embedding meaningful metadata to improve Document retrieval.

[ ]

Load data from Wikipedia

We are going to download the Wikipedia pages related to some bands, using the python library wikipedia.

These pages are converted into Haystack Documents.

[ ]

🔧 Setup the experiment

Utility functions to create Pipelines

The indexing Pipeline transforms the Documents and stores them (with vectors) in a Document Store. The retrieval Pipeline takes a query as input and perform the vector search.

I build some utility functions to create different indexing and retrieval Pipelines.

In fact, I am interested in comparing the standard approach (where we only embed text) with the embedding metadata strategy (we embed text + meaningful metadata).

[ ]

Create the Pipelines

Let's define 2 Document Stores, to compare the different approaches.

[ ]

Now, I create the 2 indexing pipelines and run them.

[ ]

Create the 2 retrieval pipelines.

[ ]

🧪 Run the experiment!

[ ]

❌ the retrieved Documents seem irrelevant

[ ]

✅ the first Document is relevant

[ ]

❌ the retrieved Documents seem irrelevant

[ ]

✅ some Documents are relevant

⚠️ Notes of caution

This technique is not a silver bullet
It works well when the embedded metadata are meaningful and distinctive
I would say that the embedded metadata should be meaningful from the perspective of the embedding model. For example, I don't expect embedding numbers to work well.