Azure Entity Extraction For Long Documents

Entity Extraction For Long Documents

azure-openai-samplesBasic_SamplesdotnetcsharpChat

alph-notebooks/azure-openai-samples / Entity_extraction_for_long_documents.ipynb

Export

Run Notebooks

Contents

No cells yet

Add cells to see them here

Long Document Content Extraction

GPT-3 can help us extract key figures, dates or other bits of important content from documents that are too big to fit into the context window. One approach for solving this is to chunk the document up and process each chunk separately, before combining into one list of answers. In this notebook we'll run through this approach:

Load in a long PDF and pull the text out
Create a prompt to be used to extract key bits of information
Chunk up our document and process each chunk to pull any answers out
Combine them at the end
This simple approach will then be extended to three more difficult questions

Approach

Setup: Take a PDF, a Formula 1 Financial Regulation document on Power Units, and extract the text from it for entity extraction. We'll use this to try to extract answers that are buried in the content.
Simple Entity Extraction: Extract key bits of information from chunks of a document by:
- Creating a template prompt with our questions and an example of the format it expects
- Create a function to take a chunk of text as input, combine with the prompt and get a response
- Run a script to chunk the text, extract answers and output them for parsing
Complex Entity Extraction: Ask some more difficult questions which require tougher reasoning to work out

[1]

[ ]

[3]

Run this cell, it will prompt you for the apiKey, endPoint, and chatDeployment

[4]

[5]

Install `itext` pacakge

Parsing pdfs using iTextCore from nuget

[6]

[7]

[8]

[9]

Simple Entity Extraction

[10]

[12]

The extractions object is a collection that contains the extracted information. Each item in the collection is a string that represents a piece of information extracted from the document.

The Where method is a LINQ (Language Integrated Query) method that is used to filter the collection. The method takes a lambda expression p => !p.Contains("Not specified") as an argument. This expression is a function that takes an item from the collection (represented by p) and returns true if the item does not contain the string "Not specified", and false otherwise. In other words, this method filters out any items in the collection that contain the string "Not specified".

The DisplayTable method is then called on the filtered collection. This method displays the items in the collection as a table. Each item in the collection is displayed as a row in the table.

In summary, this code is used to filter out any items in the extractions collection that contain the string "Not specified", and then display the remaining items as a table.

[13]

Complex Entity Extraction

[14]

[15]

Consolidation

We've been able to extract the first two answers safely, while the third was confounded by the date that appeared on every page, though the correct answer is in there as well.

To tune this further you can consider experimenting with:

A more descriptive or specific prompt
If you have sufficient training data, fine-tuning a model to find a set of outputs very well
The way you chunk your data - we have gone for 1000 tokens with no overlap, but more intelligent chunking that breaks info into sections, cuts by tokens or similar may get better results

However, with minimal tuning we have now answered 6 questions of varying difficulty using the contents of a long document, and have a reusable approach that we can apply to any long document requiring entity extraction. Look forward to seeing what you can do with this!