deepset Metadata Extraction With Llm Metadata Extractor

Metadata Extraction With Llm Metadata Extractor

agentic-aiagenticagentsgenaiAIhaystack-cookbookgenai-usecaseshaystack-ainotebooksPythonragai-tools

alph-notebooks/haystack-cookbook / metadata_extraction_with_llm_metadata_extractor.ipynb

Export

Run Notebooks

Contents

No cells yet

Add cells to see them here

LLMMetaDataExtractor: seamless metadata extraction from documents with just a prompt

Notebook by David S. Batista

This notebook shows how to use LLMMetadataExtractor, we will use a arge Language Model to perform metadata extraction from a Document.

Setting Up

[ ]

Initialize LLMMetadataExtractor

Let's define what kind of metadata we want to extract from our documents, we wil do it through a LLM prompt, which will then be used by the LLMMetadataExtractor component. In this case we want to extract named-entities from our documents.

[2]

Let's initialise an instance of the LLMMetadataExtractor using OpenAI as the LLM provider and the prompt defined above to perform metadata extraction

[3]

We will also need to set the OPENAI_API_KEY

[4]

Enter OpenAI API key: ········

We will instatiate a LLMMetadataExtractor instance using the OpenAI as LLM provider. Notice that the parameter prompt is set to the prompt we defined above, and that we also need to set which keys should be present in the JSON ouput, in this case "entities".

Another important aspect is the raise_on_failure=False, if for some document the LLM fails (e.g.: network error, or doesn't return a valid JSON object) we continue the processing of all the documents in the input.

[17]

Let's define documents from which the component will extract metadata, i.e.: named-entities

[6]

[13]

and let's extract :)

[14]

[15]

{'documents': [Document(id=05fe6674dd4faf3dcaa991f9e6d520c9185d5644c4ac2b8b52276e6b70a831f2, content: 'deepset was founded in 2018 in Berlin, and is known for its Haystack framework', meta: {'entities': [{'entity': 'deepset', 'entity_type': 'company'}, {'entity': 'Berlin', 'entity_type': 'city'}, {'entity': 'Haystack', 'entity_type': 'product'}]}),
,  Document(id=37364c858185cf02abc43b43db613d236baa4dd501453d6942681842863c313a, content: 'Hugging Face is a company that was founded in New York, USA and is known for its Transformers librar...', meta: {'entities': [{'entity': 'Hugging Face', 'entity_type': 'company'}, {'entity': 'New York', 'entity_type': 'city'}, {'entity': 'USA', 'entity_type': 'country'}, {'entity': 'Transformers library', 'entity_type': 'product'}]}),
,  Document(id=eb4e2410115dfb7edc47b84853d0cdc845699120509346383896ed7d47354e2d, content: 'Google was founded in 1998 by Larry Page and Sergey Brin', meta: {'entities': [{'entity': 'Google', 'entity_type': 'company'}, {'entity': 'Larry Page', 'entity_type': 'person'}, {'entity': 'Sergey Brin', 'entity_type': 'person'}]}),
,  Document(id=ee28eff307d3a1d435f0515195e0a86e592b72b5570dcaddc4d3108769632596, content: 'Peugeot is a French automotive manufacturer that was founded in 1810 by Jean-Pierre Peugeot', meta: {'entities': [{'entity': 'Peugeot', 'entity_type': 'company'}, {'entity': 'France', 'entity_type': 'country'}, {'entity': 'Jean-Pierre Peugeot', 'entity_type': 'person'}]}),
,  Document(id=0a56bf794d37839113a73634cc0f3ecab33744eeea7b682b49fd2dc51737aed8, content: 'Siemens is a German multinational conglomerate company headquartered in Munich and Berlin, founded i...', meta: {'entities': [{'entity': 'Siemens', 'entity_type': 'company'}, {'entity': 'Germany', 'entity_type': 'country'}, {'entity': 'Munich', 'entity_type': 'city'}, {'entity': 'Berlin', 'entity_type': 'city'}, {'entity': 'Werner von Siemens', 'entity_type': 'person'}]})],
, 'failed_documents': []}

Indexing Pipeline with Extraction

Let's now build an indexing pipeline, where we simply give the Documents as input and get a Document Store with the documents indexed with metadata

[18]

<haystack.core.pipeline.pipeline.Pipeline object at 0x320d71010>
,🚅 Components
,  - metadata_extractor: LLMMetadataExtractor
,  - embedder: SentenceTransformersDocumentEmbedder
,  - writer: DocumentWriter
,🛤️ Connections
,  - metadata_extractor.documents -> embedder.documents (List[Document])
,  - embedder.documents -> writer.documents (List[Document])

Try it Out!

[19]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

{'metadata_extractor': {'failed_documents': []},
, 'writer': {'documents_written': 5}}

Let's inspect the documents metadata in the document store

[20]

deepset was founded in 2018 in Berlin, and is known for its Haystack framework
{'entities': [{'entity': 'deepset', 'entity_type': 'company'}, {'entity': 'Berlin', 'entity_type': 'city'}, {'entity': 'Haystack', 'entity_type': 'product'}]}

---------
Hugging Face is a company that was founded in New York, USA and is known for its Transformers library
{'entities': [{'entity': 'Hugging Face', 'entity_type': 'company'}, {'entity': 'New York', 'entity_type': 'city'}, {'entity': 'USA', 'entity_type': 'country'}, {'entity': 'Transformers library', 'entity_type': 'product'}]}

---------
Google was founded in 1998 by Larry Page and Sergey Brin
{'entities': [{'entity': 'Google', 'entity_type': 'company'}, {'entity': 'Larry Page', 'entity_type': 'person'}, {'entity': 'Sergey Brin', 'entity_type': 'person'}]}

---------
Peugeot is a French automotive manufacturer that was founded in 1810 by Jean-Pierre Peugeot
{'entities': [{'entity': 'Peugeot', 'entity_type': 'company'}, {'entity': 'France', 'entity_type': 'country'}, {'entity': 'Jean-Pierre Peugeot', 'entity_type': 'person'}]}

---------
Siemens is a German multinational conglomerate company headquartered in Munich and Berlin, founded in 1847 by Werner von Siemens
{'entities': [{'entity': 'Siemens', 'entity_type': 'company'}, {'entity': 'Germany', 'entity_type': 'country'}, {'entity': 'Munich', 'entity_type': 'city'}, {'entity': 'Berlin', 'entity_type': 'city'}, {'entity': 'Werner von Siemens', 'entity_type': 'person'}]}

---------