Information Extraction Raven
π§ͺπ¦ββ¬ Needle in a Jungle - Information Extraction via LLMs
Β Β Β Β Β Β 
Notebook by Stefano Fiorucci
In this experiment, we will use Large Language Models to perform Information Extraction from textual data.
π― Goal: create an application that, given a URL and a specific structure provided by the user, extracts information from the source.
The "function calling" capabilities of OpenAI models unlock this task: the user can describe a structure, by defining a mock up function with all its typed and specific parameters. The LLM will prepare the data in this specific form and send it back to the user.
A nice example of using OpenAI Function Calling for information extraction is this gist by Kyle McDonald.
What is changing now is that open models such as NexusRaven are emerging, with function calling capabilities...
This is an improved version of an older experiment, using Gorilla Open Functions
Stack
- NexusRaven: an open-source and commercially viable function calling model that surpasses the state-of-the-art in function calling capabilities.
- Haystack: open-source LLM orchestration framework that streamlines the development of your LLM applications.
Install the dependencies
Load and try the model
We use the HuggingFaceAPIGenerator, which allows to use models hosted on Hugging Face endpoints.
In particular, we use a paid endpoint kindly provided by Nexusflow to test the LLM.
Alternative inference options:
- load the model on Colab using the HuggingFaceLocalGenerator. This is a bit impractical because the model is quite big (13B parameters) and even using quantization, there would be few GPU resources left for inference.
- local inference via TGI or vLLM: this is a good option if you have GPU avalaible.
- local inference via Ollama/llama.cpp: this is suitable for machines with few resources and no GPU. Keep in mind that in this case a quantized GGUF version of the model would be used, with lower quality than the original model.
/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_token.py:80: UserWarning: Access to the secret `HF_TOKEN` has not been granted on this notebook. You will not be requested again. Please restart the session if you want to be prompted again. warnings.warn(
tokenizer_config.json: 0%| | 0.00/985 [00:00<?, ?B/s]
tokenizer.model: 0%| | 0.00/500k [00:00<?, ?B/s]
tokenizer.json: 0%| | 0.00/1.85M [00:00<?, ?B/s]
added_tokens.json: 0%| | 0.00/195 [00:00<?, ?B/s]
special_tokens_map.json: 0%| | 0.00/623 [00:00<?, ?B/s]
To understand how to prompt the model, give a look at the Prompting notebook. Later we will see how to better organize the prompt for our purpose.
{'replies': ["Call: get_weather_data(coordinates=get_coordinates_from_city(city_name='Seattle'))"], 'meta': [{'model': 'http://38.142.9.20:10240', 'index': 0, 'finish_reason': 'stop_sequence', 'usage': {'completion_tokens': 29, 'prompt_tokens': 188, 'total_tokens': 217}}]}
All good! β
Prompt template and Prompt Builder
- The Prompt template to apply is model specific. In our case, we customize a bit the original prompt which is available on Prompting notebook.
- In Haystack, the prompt template is rendered using the Prompt Builder component.
{'prompt': '\nFunction:\nmy fake function definition\nUser Query: Save data from the provided text. START TEXT:my fake document END TEXT\n<human_end>'}
Nice β
Other Components
The following Components are required for the Pipeline we are about to create. However, they are simple and there is no need to customize and try them, so we can instantiate them directly during Pipeline creation.
- LinkContentFetcher: fetches the contents of the URLs you give it and returns a list of content streams.
- HTMLToDocument: converts HTML files to Documents.
- DocumentCleaner: make text documents more readable.
Define a custom Component to parse and visualize the result
The output of the model generation is a function call string.
We are going to create a simple Haystack Component to appropriately parse this string and create a nice HTML visualization.
For more information on Creating custom Components, see the docs.
Create an Information Extraction Pipeline
To combine the Components in an appropriate and reproducible way, we resort to Haystack Pipelines. The syntax should be easily understood. You can find more infomation in the docs.
This pipeline will extract the information from the given URL following the provided structure.
Now we create an extract function that wraps the Pipeline and displays the result in the HTML format.
This will accept:
- a
functiondict, containing the structure definition of the information we want to extract - a
url, to use as data source
πΉοΈ Try our application!
Let's first define the structure to extract.
We are going to parse some news articles about animals... π¦π»π¦
Let's start with an article about Capybaras
INFO:haystack.core.pipeline.pipeline:Warming up component generator...
["\nCall: save_data(about_animals=True, about_ai=False, habitat=['Panama', 'Colombia', 'Venezuela', 'Guyana', 'Peru', 'Brazil', 'Paraguay', 'Northeast Argentina', 'Uruguay'], predators=['jaguars', 'caimans', 'anacondas', 'ocelots', 'harpy eagles'], diet=['vegetation', 'grass', 'grains', 'melons', 'reeds', 'squashes'])"]
Now let's try with an article about the Andean cock of the rock
INFO:haystack.core.pipeline.pipeline:Warming up component generator...
["\nCall: save_data(about_animals=True, about_ai=False, habitat=['Andes'], predators=['birds of prey', 'puma', 'jaguars', 'boa constrictors'], diet=['fruit', 'insects', 'small vertebrates'])"]
Now, the Yucatan Deer!
INFO:haystack.core.pipeline.pipeline:Warming up component generator...
["\nCall: save_data(about_animals=True, about_ai=False, habitat=['forests'], predators=['cougar', 'jaguar'], diet=['grass', 'leaves', 'sprouts', 'lichens', 'mosses', 'tree bark', 'fruit'])"]
A completely different example, about AI...
INFO:haystack.core.pipeline.pipeline:Warming up component generator...
["\nCall: save_data(people=['Sam Altman', 'Greg Brockman', 'Bret Taylor', 'Larry Summers', 'Adam DβAngelo', 'Ilya Sutskever', 'Emmett Shear'], companies=['OpenAI', 'Microsoft', 'Thrive Capital'], summary='Sam Altman will return as CEO of OpenAI, overcoming an attempted boardroom coup that sent the company into chaos over the past several days.', topics=['OpenAI', 'Artificial intelligence', 'Machine learning', 'Computer vision', 'Natural language processing'], about_animals=False, about_ai=True)"]
INFO:haystack.core.pipeline.pipeline:Warming up component generator...
["\nCall: save_data(people=['Sam Bankman-Fried'], companies=['FTX'], summary='Sam Bankman-Fried will not face second trial after multibillion-dollar crypto fraud conviction', topics=['crypto fraud', 'FTX', 'cryptocurrency exchange'], about_animals=False, about_ai=False)"]
INFO:haystack.core.pipeline.pipeline:Warming up component generator...
["\nCall: save_data(people=['Michelle Toh', 'Wayne Chang', 'Jensen Huang', 'Lisa Su'], companies=['Nvidia', 'AMD'], summary='The Taiwanese American cousins going head-to-head in the global AI race', topics=['chip industry', 'global AI chip industry', 'Taiwanese descent', 'semiconductors', 'generative AI'], about_animals=False, about_ai=True)"]
β¨ Conclusions and caveats
- Nexus Raven seems to work much better than Gorilla Open Functions (v0) for this use case.
- I would also expect it to work significantly better than generic models to which grammars are added to make them produce JSON.
- β οΈ When the content of the web page is cluttered with extraneous information such as advertisements and interruptions, the model encounters difficulty in extracting relevant information, leading to occasional instances where it returns empty responses.
- β οΈ As a statistical model, the LLM is highly responsive to prompts. For instance, modifying the order and description of the specified arguments can yield different extraction results.
π References
Related to the experiment