deepset Information Extraction Gorilla

Information Extraction Gorilla

agentic-aiagenticagentsgenaiAIhaystack-cookbookgenai-usecaseshaystack-ainotebooksPythonragai-tools

alph-notebooks/haystack-cookbook / information-extraction-gorilla.ipynb

Export

Run Notebooks

Contents

No cells yet

Add cells to see them here

🧪🦍 Needle in a Jungle - Information Extraction via LLMs

Notebook by Stefano Fiorucci

In this experiment, we will use Large Language Models to perform Information Extraction from textual data.

🎯 Goal: create an application that, given a text (or URL) and a specific structure provided by the user, extracts information from the source.

The "function calling" capabilities of OpenAI models unlock this task: the user can describe a structure, by defining a fake function with all its typed and specific parameters. The LLM will prepare the data in this specific form and send it back to the user.

A nice example of using OpenAI Function Calling for information extraction is this gist by Kyle McDonald.

What is changing now is that open models such as Gorilla are emerging, with function calling capabilities...

Stack

Gorilla OpenFunctions: an open-source model that formulates executable API/Function calls given natural language instructions and API/Function definition.
Haystack: open-source LLM orchestration framework that streamlines the development of your LLM applications.

Install the dependencies

accelerate and bitsandbytes are needed to load the model in a quantized version that runs smoothly on Colab.

[ ]

Load and try the model

We use the HuggingFaceLocalGenerator, which allows to locally load a model hosted on Hugging Face. We also specify some quantization options to run the model with the limited resources offered by Colab. An article about the HuggingFaceLocalGenerator on Haystack.

A few notes:

Although the model would be availaible in a free deployed version, with an OpenAI-compatible API, I decided not to use this option, as I found the server rather unstable.
To load the model on Colab, I sharded it myself and published it on Hugging Face. To understand why you need a sharded version, you can read this excellent article by Maarten Grootendorst.

[ ]

Loading checkpoint shards:   0%|          | 0/7 [00:00<?, ?it/s]

To understand how to prompt the model, give a look at the GitHub README. Later we will see how to better organize the prompt for our purpose.

[ ]

{'replies': [' uber.ride(loc="berkeley", type="plus", time=10)']}

All good! ✅

Prompt template and Prompt Builder

The Prompt template to apply is model specific. In our case, we customize a bit the original prompt which is available on GitHub.
In Haystack, the prompt template is rendered using the Prompt Builder component.

[ ]

{'prompt': 'USER: <<question>> Extract data from the following text. START TEXT. my fake document END TEXT. <<function>> my fake function definition\n\nASSISTANT: '}

Nice ✅

Other Components

The following Components are required for the Pipeline we are about to create. However, they are simple and there is no need to customize and try them, so we can instantiate them directly during Pipeline creation.

LinkContentFetcher: fetches the contents of the urls you give it and returns a list of content streams.
HTMLToDocument: converts HTML files to Documents.
DocumentJoiner: join lists of Documents.
DocumentCleaner: make text documents more readable.

Define a custom Component to parse and visualize the result

The output of the model generation is a function call string.

We are going to create a simple Haystack Component to appropriately parse this string and create a nice HTML visualization.

For more information on Creating custom Components, see the docs.

[ ]

Create an Information Extraction Pipeline

To combine the Components in an appropriate and reproducible way, we resort to Haystack Pipelines. The syntax should be easily understood. You can find more infomation in the docs.

[ ]

Let's draw our Pipeline!

[ ]

Now we create an extract function that wraps the Pipeline. This will accept:

a function dict, containing the structure definition of the information we want to extract
a url or a text, to use as data source

[ ]

🕹️ Try our application!

Let's first define the structure to extract.

We are going to parse some articles about animals... 🦆🐻🦌

[ ]

Let's start with an article about Capybaras

[ ]

This is a friendly reminder - the current text generation call will exceed the model's predefined maximum length (2048). Depending on the model, you may observe exceptions, performance degradation, or nothing at all.

Now let's try with some text about the Andean cock of the rock (from https://www.rainforest-alliance.org/species/cock-rock/)

[ ]

Now, the Yucatan Deer!

[ ]

A completely different example, about AI...

[ ]

⚠️ Caveats and 🔮 future directions

I found the model a bit unstable. Changing the function definition slightly can alter the extraction results and even lead to unparsable function calls.
It would be nice to try other similar models, such as NexusRaven
Good open source generic models (not fine-tuned for function calling) should also be investigated (Anyscale announcement on function calling using Mistral-7B-Instruct-v0.1)

📚 References

Related to the experiment

Haystack LLM framework
Using OpenAI Function Calling for Information Extraction: gist by Kyle McDonald
Gorilla OpenFunctions: release post and GitHub repository

Other interesting resources on the topic