Notebooks
d
deepset
Metadata Enrichment

Metadata Enrichment

agentic-aiagenticagentsgenaiAIhaystack-cookbookgenai-usecaseshaystack-ainotebooksPythonragai-tools

Advanced RAG: Automated Structured Metadata Enrichment

by Tuana Celik (LI, Twitter)

This is part one of the Advanced Use Cases series:

1️⃣ Extract Metadata from Queries to Improve Retrieval cookbook & full article

2️⃣ Query Expansion cookbook & full article

3️⃣ Query Decomposition cookbook & the full article

4️⃣ Automated Metadata Enrichment

In this example, you'll see how you can make use of structured outputs which is an option for some LLMs, and a custom Haystack component, to automate the enrichment of metadata from documents.

You will see how you can define your own metadata fields as a Pydantic Model, as well as the data types each field should have. Finally, you will get a custom MetadataEnricher to extract the required fields and add them to the document meta information.

In this example, we will be enriching metadata with information relating the funding announements.

Once we populate the metadata of a document with our own fields, we are able to use Metadata Filtering during the retrieval step of RAG pipelines. We can even combine this with Metadata Extraction from Queries to Improve Retrieval to be very precise about what documents we are providing as context to an LLM.

📺 Code Along

Install requirements

[ ]
[70]

🧪 Experimental Addition to the OpenAIGenerator for Structured Output Support

🚀 This is the same extension to the OpenAIGenerator that was used in the Advanced RAG: Query Decomposition and Reasoning example

Let's extend the OpenAIGeneraotor to be able to make use of the strctured output option by OpenAI. Below, we extend the class to call self.client.beta.chat.completions.parse if the user has provides a respose_format in generation_kwargs. This will allow us to provifde a Pydantic Model to the gnerator and request our generator to respond with structured outputs that adhere to this Pydantic schema.

[71]
[3]
OpenAI API Key:··········

Custom MetadataEnricher

We create a custom Haystack component that is able ti accept metadata_model and prompt. If no prompt is provided, it usees the DEFAULT_PROMPT.

This component returns documents enriched with the requested metadata fileds.

[72]

Define Metadata Fields as a Pydantic Model

For automatic metadata enrichment, we want to be able to provide a structure describing what fields we want to extract, as well as what types they should be.

Below, I have defined a Metadata model, with 4 fields.

💡 Note: In some cases, it might make sense to make each field optional, or provide default values.

[73]

Next, we initialize a MetadataEnricher and provide Metadata as the metadata_model we want to abide by.

[79]

Build an Automated Metadata Enrichment Pipeline

Now that we have our enricher, we can use it in a pipeline. Below is an example of a pipeline that fetches the contents of some URLs (in this case, urls that contain information about funding announcements). The pipeline then adds the requested metadata fields to each Document's meta field 👇

[80]
{'enricher': {'documents': [Document(id=5844517120556b13f92430ea8af9837714ede1b351580c43c2ddce9b646cb6cb, content: 'Deepset, a platform for building enterprise apps powered by large language models akin to ChatGPT, t...', meta: {'content_type': 'text/html', 'url': 'https://techcrunch.com/2023/08/09/deepset-secures-30m-to-expand-its-llm-focused-mlops-offerings/', 'company': 'Deepset', 'year': 2023, 'funding_value': 30000000, 'funding_currency': 'USD'}),
,   Document(id=8cdcb63a4e006b1cac902ebc2e012cd95156d188777e0d0c8bd407a92f4491c7, content: 'Arize AI Raises $38 Million Series B To Scale Machine Learning Observability Platform
,   As companies t...', meta: {'content_type': 'text/html', 'url': 'https://www.prnewswire.com/news-releases/arize-ai-raises-38-million-series-b-to-scale-machine-learning-observability-platform-301620603.html', 'company': 'Arize AI', 'year': 2022, 'funding_value': 38000000, 'funding_currency': 'USD'})]}}
[81]
Output

Extra: Metadata Inheritance

This is just an extra step to show how metadata that belongs to a document is inherited by the document chunks if you use a component such as the DocumentSplitter.

[82]
<haystack.core.pipeline.pipeline.Pipeline object at 0x77ff80aee8c0>
,🚅 Components
,  - fetcher: LinkContentFetcher
,  - converter: HTMLToDocument
,  - enricher: MetadataEnricher
,  - splitter: DocumentSplitter
,🛤️ Connections
,  - fetcher.streams -> converter.sources (List[ByteStream])
,  - converter.documents -> enricher.documents (List[Document])
,  - enricher.documents -> splitter.documents (List[Document])
[83]
{'splitter': {'documents': [Document(id=9611aa2bdb658163d8f6964220052065936fcd036dd24743d1b34ce79d25bc5a, content: 'Deepset, a platform for building enterprise apps powered by large language models akin to ChatGPT, t...', meta: {'content_type': 'text/html', 'url': 'https://techcrunch.com/2023/08/09/deepset-secures-30m-to-expand-its-llm-focused-mlops-offerings/', 'company': 'Deepset', 'year': 2023, 'funding_value': 30000000, 'funding_currency': 'USD', 'source_id': '5844517120556b13f92430ea8af9837714ede1b351580c43c2ddce9b646cb6cb', 'page_number': 1, 'split_id': 0, 'split_idx_start': 0}),
,   Document(id=6bffbcf9f1cd1a3940628d1450c9ba9a8c9a092136896d295b35af3175caffbf, content: 'unfortunate state of affairs is likely contributing to challenges around AI development within the e...', meta: {'content_type': 'text/html', 'url': 'https://techcrunch.com/2023/08/09/deepset-secures-30m-to-expand-its-llm-focused-mlops-offerings/', 'company': 'Deepset', 'year': 2023, 'funding_value': 30000000, 'funding_currency': 'USD', 'source_id': '5844517120556b13f92430ea8af9837714ede1b351580c43c2ddce9b646cb6cb', 'page_number': 1, 'split_id': 1, 'split_idx_start': 1256}),
,   Document(id=da372f9bc2292f487f0aad372053a531744d9549a46f8ac197c216abaa4d99d0, content: 'to end users, and perform analyses of the LLMs’ accuracy while continuously monitoring their perform...', meta: {'content_type': 'text/html', 'url': 'https://techcrunch.com/2023/08/09/deepset-secures-30m-to-expand-its-llm-focused-mlops-offerings/', 'company': 'Deepset', 'year': 2023, 'funding_value': 30000000, 'funding_currency': 'USD', 'source_id': '5844517120556b13f92430ea8af9837714ede1b351580c43c2ddce9b646cb6cb', 'page_number': 1, 'split_id': 2, 'split_idx_start': 2609}),
,   Document(id=f316cd275e8bc763de41d128dbdbd81e1baad2693b0102f6951c4f46aa8f6048, content: 'predicts that the sector for MLOps will reach $23.1 billion by 2031, up from around $1 billion in 20...', meta: {'content_type': 'text/html', 'url': 'https://techcrunch.com/2023/08/09/deepset-secures-30m-to-expand-its-llm-focused-mlops-offerings/', 'company': 'Deepset', 'year': 2023, 'funding_value': 30000000, 'funding_currency': 'USD', 'source_id': '5844517120556b13f92430ea8af9837714ede1b351580c43c2ddce9b646cb6cb', 'page_number': 1, 'split_id': 3, 'split_idx_start': 3997}),
,   Document(id=c7ff4e0d7af8aaa16f3195cb1f9096bb1cf8e7d985190fa6746c278b1d8457e8, content: 'Arize AI Raises $38 Million Series B To Scale Machine Learning Observability Platform
,   As companies t...', meta: {'content_type': 'text/html', 'url': 'https://www.prnewswire.com/news-releases/arize-ai-raises-38-million-series-b-to-scale-machine-learning-observability-platform-301620603.html', 'company': 'Arize AI', 'year': 2022, 'funding_value': 38000000, 'funding_currency': 'USD', 'source_id': '8cdcb63a4e006b1cac902ebc2e012cd95156d188777e0d0c8bd407a92f4491c7', 'page_number': 1, 'split_id': 0, 'split_idx_start': 0}),
,   Document(id=646f4d43ee20fcbe35c97bd04e8fd6edd4ad5c9af63fe4a63267f5df1807254f, content: 'by humans.
,   Launched in 2020, Arize's ML observability platform is already counted on by a growing li...', meta: {'content_type': 'text/html', 'url': 'https://www.prnewswire.com/news-releases/arize-ai-raises-38-million-series-b-to-scale-machine-learning-observability-platform-301620603.html', 'company': 'Arize AI', 'year': 2022, 'funding_value': 38000000, 'funding_currency': 'USD', 'source_id': '8cdcb63a4e006b1cac902ebc2e012cd95156d188777e0d0c8bd407a92f4491c7', 'page_number': 1, 'split_id': 1, 'split_idx_start': 1360}),
,   Document(id=09382ae0ab9adbd7860199b0d86e8ca044787eda860dfb181e3b243c6584a427, content: 'what happened, and improve overall model performance," says Morgan Gerlak, Partner at TCV. "Like oth...', meta: {'content_type': 'text/html', 'url': 'https://www.prnewswire.com/news-releases/arize-ai-raises-38-million-series-b-to-scale-machine-learning-observability-platform-301620603.html', 'company': 'Arize AI', 'year': 2022, 'funding_value': 38000000, 'funding_currency': 'USD', 'source_id': '8cdcb63a4e006b1cac902ebc2e012cd95156d188777e0d0c8bd407a92f4491c7', 'page_number': 1, 'split_id': 2, 'split_idx_start': 2697}),
,   Document(id=94497826791e38f15016a2360d7e3eea5e242770f51dd11be69fb21210e89c9a, content: 'you are going to be left behind," notes Brett Wilson, Co-Founder and General Partner at Swift Ventur...', meta: {'content_type': 'text/html', 'url': 'https://www.prnewswire.com/news-releases/arize-ai-raises-38-million-series-b-to-scale-machine-learning-observability-platform-301620603.html', 'company': 'Arize AI', 'year': 2022, 'funding_value': 38000000, 'funding_currency': 'USD', 'source_id': '8cdcb63a4e006b1cac902ebc2e012cd95156d188777e0d0c8bd407a92f4491c7', 'page_number': 1, 'split_id': 3, 'split_idx_start': 3967})]}}