Extracting Metadata Filters From A User Query
Extract Metadata Filters from a Query
Notebook by David Batista
This is part one of the Advanced Use Cases series:
1️⃣ Extract Metadata from Queries to Improve Retrieval & the full article
2️⃣ Query Expansion cookbook & full article
3️⃣ Query Decomposition cookbook & the full article
In this notebook, we'll discuss how to implement a custom component, QueryMetadataExtractor, that extracts entities from the query and formulates the corresponding metadata filter.
Useful Sources
Setup the Development Environment
Enter your OPENAI_API_KEY. Get your OpenAI API key here:
Enter OpenAI API key:··········
Implement QueryMetadataExtractor
Create a custom component, QueryMetadataExtractor, which takes query and metadata_fields as inputs and outputs filters. This component encapsulates a generative pipeline, made up of PromptBuilder and OpenAIGenerator. The pipeline instructs the LLM to extract keywords, phrases, or entities from a given query which can then be used as metadata filters. In the prompt, we include instructions to ensure the output format is in JSON and provide metadata_fields along with the query to ensure the correct entities are extracted from the query.
Once the pipeline is initialized in the init method of the component, we post-process the LLM output in the run method. This step ensures the extracted metadata is correctly formatted to be used as a metadata filter.
First, let's test the QueryMetadataExtractor in isolation, passing a query and a list of metadata fields.
{'filters': {'operator': 'AND', 'conditions': [{'field': 'meta.year', 'operator': '==', 'value': 2022}, {'field': 'meta.disease', 'operator': '==', 'value': 'Parkinson'}]}}
Notice that the QueryMetadataExtractor has extracted the metadata fields from the query and returned them in a format that can be used as filters passed directly to a Retriever. By default, the QueryMetadataExtractor will use all metadata fields as conditions together with an AND operator.
Use QueryMetadataExtractor in a Pipeline
Now, let's plug the QueryMetadataExtractor into a Pipeline with a Retriever connected to a DocumentStore to see how it works in practice.
We start by creating a InMemoryDocumentStore and adding some documents to it. We include info about “year” and “disease” in the “meta” field of each document.
4
We then create a pipeline consisting of the QueryMetadataExtractor and a InMemoryBM25Retriever connected to the InMemoryDocumentStore created above.
Learn about connecting components and creating pipelines in Docs: Creating Pipelines.
<haystack.core.pipeline.pipeline.Pipeline object at 0x789b1bba1900> ,🚅 Components , - metadata_extractor: LLMMetadataQueryExtractor , - retriever: InMemoryBM25Retriever ,🛤️ Connections , - metadata_extractor.filters -> retriever.filters (Dict[str, str])
Now define a query and metadata fields and pass them to the pipeline:
Ranking by BM25...: 0%| | 0/1 [00:00<?, ? docs/s]
{'retriever': {'documents': [Document(id=e3b0bfd497a9f83397945583e77b293429eb5bdead5680cc8f58dd4337372aa3, content: 'some text about investigation and treatment of Alzheimer disease', meta: {'year': 2023, 'disease': 'Alzheimer', 'author': 'John Bread'}, score: 2.772588722239781)]}}