Notebooks
d
deepset
Extracting Metadata Filters From A User Query

Extracting Metadata Filters From A User Query

agentic-aiagenticagentsgenaiAIhaystack-cookbookgenai-usecaseshaystack-ainotebooksPythonragai-tools

Extract Metadata Filters from a Query

Notebook by David Batista

This is part one of the Advanced Use Cases series:

1️⃣ Extract Metadata from Queries to Improve Retrieval & the full article

2️⃣ Query Expansion cookbook & full article

3️⃣ Query Decomposition cookbook & the full article

4️⃣ Automated Metadata Enrichment

In this notebook, we'll discuss how to implement a custom component, QueryMetadataExtractor, that extracts entities from the query and formulates the corresponding metadata filter.

Useful Sources

Setup the Development Environment

[ ]

Enter your OPENAI_API_KEY. Get your OpenAI API key here:

[ ]
Enter OpenAI API key:··········

Implement QueryMetadataExtractor

Create a custom component, QueryMetadataExtractor, which takes query and metadata_fields as inputs and outputs filters. This component encapsulates a generative pipeline, made up of PromptBuilder and OpenAIGenerator. The pipeline instructs the LLM to extract keywords, phrases, or entities from a given query which can then be used as metadata filters. In the prompt, we include instructions to ensure the output format is in JSON and provide metadata_fields along with the query to ensure the correct entities are extracted from the query.

Once the pipeline is initialized in the init method of the component, we post-process the LLM output in the run method. This step ensures the extracted metadata is correctly formatted to be used as a metadata filter.

[ ]

First, let's test the QueryMetadataExtractor in isolation, passing a query and a list of metadata fields.

[ ]
{'filters': {'operator': 'AND', 'conditions': [{'field': 'meta.year', 'operator': '==', 'value': 2022}, {'field': 'meta.disease', 'operator': '==', 'value': 'Parkinson'}]}}

Notice that the QueryMetadataExtractor has extracted the metadata fields from the query and returned them in a format that can be used as filters passed directly to a Retriever. By default, the QueryMetadataExtractor will use all metadata fields as conditions together with an AND operator.

Use QueryMetadataExtractor in a Pipeline

Now, let's plug the QueryMetadataExtractor into a Pipeline with a Retriever connected to a DocumentStore to see how it works in practice.

We start by creating a InMemoryDocumentStore and adding some documents to it. We include info about “year” and “disease” in the “meta” field of each document.

[ ]
4

We then create a pipeline consisting of the QueryMetadataExtractor and a InMemoryBM25Retriever connected to the InMemoryDocumentStore created above.

Learn about connecting components and creating pipelines in Docs: Creating Pipelines.

[ ]
<haystack.core.pipeline.pipeline.Pipeline object at 0x789b1bba1900>
,🚅 Components
,  - metadata_extractor: LLMMetadataQueryExtractor
,  - retriever: InMemoryBM25Retriever
,🛤️ Connections
,  - metadata_extractor.filters -> retriever.filters (Dict[str, str])

Now define a query and metadata fields and pass them to the pipeline:

[24]
Ranking by BM25...:   0%|          | 0/1 [00:00<?, ? docs/s]
{'retriever': {'documents': [Document(id=e3b0bfd497a9f83397945583e77b293429eb5bdead5680cc8f58dd4337372aa3, content: 'some text about investigation and treatment of Alzheimer disease', meta: {'year': 2023, 'disease': 'Alzheimer', 'author': 'John Bread'}, score: 2.772588722239781)]}}