Llama32 Agentic Rag
π΅π» Agentic RAG with π¦ Llama 3.2 3B
Β Β Β Β Β Β
Β Β Β Β Β Β ![]()
In their Llama 3.2 collection, Meta released two small yet powerful Language Models.
In this notebook, we'll use the 3B model to build an Agentic Retrieval Augmented Generation application.
π― Our goal is to create a system that answers questions using a knowledge base focused on the Seven Wonders of the Ancient World. If the retrieved documents don't contain the answer, the application will fall back to web search for additional context.
Stack:
-
ποΈ Haystack: open-source LLM orchestration framework that streamlines the development of your LLM applications.
-
π¦ Llama-3.2-3B-Instruct: small and good Language Model.
-
π¦π DuckDuckGo API Websearch to search results on the Web.
Setup
Create our knowledge base
In this section, we download a dataset on the Seven Wonders of the Ancient World, enrich each document with a semantic vector and store the documents in an in-memory database.
To better understand this process, you can explore to the introductory Haystack tutorial.
/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py:1601: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be depracted in transformers v4.45, and will be then set to `False` by default. For more details check this issue: https://github.com/huggingface/transformers/issues/31884 warnings.warn(
Batches: 0%| | 0/5 [00:00<?, ?it/s]
151
Load and try Llama 3.2
We will use Hugging Face Transformers to load the model on a Colab.
There are plenty of other options to use open models on Haystack, including for example Ollama for local inference or serving with Groq.
(π Choosing the Right Generator).
Authorization
- you need an Hugging Face account
- you need to accept Meta conditions here: https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct and wait for the authorization
Your Hugging Face tokenΒ·Β·Β·Β·Β·Β·Β·Β·Β·Β·
Loading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s]
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
{'replies': ['\n\nThe capital of France is Paris.']} Build the π΅π» Agentic RAG Pipeline
Here's the idea π
- Perform a vector search on our knowledge base using the query.
- Pass the top 5 documents to Llama, injected in a specific prompt
- In the prompt, instruct the model to reply with "no_answer" if it cannot infer the answer from the documents; otherwise, provide the answer.
- If "no_answer" is returned, run a web search and inject the results into a new prompt.
- Let Llama generate a final answer based on the web search results.
For a detailed explanation of a similar use case, take a look at this tutorial: Building Fallbacks to Websearch with Conditional Routing.
Retrieval part
Let's initialize the components to use for the initial retrieval phase.
Prompt template
Let's define the first prompt template, which instructs the model to:
- answer the query based on the retrieved documents, if possible
- reply with 'no_answer', otherwise
Conditional Router
This is the component that will perform data routing, depending on the reply given by the Language Model.
{'answer': 'this is the answer!'} {'go_to_websearch': 'my query'} Web search
Found documents: Content: Tanzania is a country in East Africa within the African Great Lakes region. It is bordered by Uganda, Kenya, the Indian Ocean, Mozambique, Malawi, Zambia, Rwanda, Burundi, and the Democratic Republic of the Congo. Content: Tanzania is a country in East Africa's Great Lakes Region, located just below the Equator. It is bordered by eight countries and the Indian Ocean, and has diverse geographical features such as mountains, lakes, rivers, and islands. Content: Tanzania is an East African country formed by the union of Tanganyika and Zanzibar in 1964. It has diverse landscapes, including Mount Kilimanjaro, Lake Victoria, and the Great Rift Valley, and a rich cultural heritage. Content: Tanzania is the largest and most populous country in East Africa, with a total area of 947,300 sq km and a coastline of 1,424 km. It has diverse natural features, including mountains, lakes, rivers, and islands, and borders eight other countries. Content: Tanzania is a country in Eastern Africa, bordering the Indian Ocean, between Kenya and Mozambique. It has many lakes, national parks, and mountains, including Mount Kilimanjaro, the highest point in Africa. Search Links: https://en.wikipedia.org/wiki/Tanzania https://www.worldatlas.com/maps/tanzania https://www.britannica.com/place/Tanzania https://www.cia.gov/the-world-factbook/countries/tanzania/ https://en.wikipedia.org/wiki/Geography_of_Tanzania
Prompt template after Web search
Assembling the Pipeline
Now that we have all the components, we can assemble the full pipeline.
To handle the different prompt sources, we'll use a BranchJoiner. This allows us to connect multiple output sockets (with prompts) to our language model. In our case, the prompt will either come from the initial prompt_builder or from prompt_builder_after_websearch.
<haystack.core.pipeline.pipeline.Pipeline object at 0x7cd028903ca0> ,π Components , - text_embedder: SentenceTransformersTextEmbedder , - retriever: InMemoryEmbeddingRetriever , - prompt_builder: PromptBuilder , - prompt_joiner: BranchJoiner , - llm: HuggingFaceLocalGenerator , - router: ConditionalRouter , - websearch: DuckduckgoApiWebSearch , - prompt_builder_after_websearch: PromptBuilder ,π€οΈ Connections , - text_embedder.embedding -> retriever.query_embedding (List[float]) , - retriever.documents -> prompt_builder.documents (List[Document]) , - prompt_builder.prompt -> prompt_joiner.value (str) , - prompt_joiner.value -> llm.prompt (str) , - llm.replies -> router.replies (List[str]) , - router.go_to_websearch -> websearch.query (str) , - router.go_to_websearch -> prompt_builder_after_websearch.query (str) , - websearch.documents -> prompt_builder_after_websearch.documents (List[Document]) , - prompt_builder_after_websearch.prompt -> prompt_joiner.value (str)
Agentic RAG in action! π
Batches: 0%| | 0/1 [00:00<?, ?it/s]
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
FROM THE KNOWLEDGE BASE: The Great Pyramid of Giza was built as the tomb of Fourth Dynasty pharaoh Khufu, and its construction is believed to have taken around 27 years to complete.
Batches: 0%| | 0/1 [00:00<?, ?it/s]
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation. Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
FROM THE WEB: Munich is located in the south of Germany, and is the capital of the federal state of Bavaria. It is connected to other major cities in Germany and Austria, and has direct access to Italy.
Batches: 0%| | 0/1 [00:00<?, ?it/s]
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
FROM THE KNOWLEDGE BASE: The head of the Colossus of Rhodes was of a standard rendering at the time, with curly hair and evenly spaced spikes of bronze or silver flame radiating from it, similar to the images found on contemporary Rhodian coins.
Batches: 0%| | 0/1 [00:00<?, ?it/s]
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation. Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
FROM THE WEB: No, the Leaning Tower of Pisa was one of the Seven Wonders of the Medieval World, but not of the ancient world.
Batches: 0%| | 0/1 [00:00<?, ?it/s]
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
FROM THE KNOWLEDGE BASE: Muawiyah I was a Muslim general who conquered Rhodes in 653.
(Notebook by Stefano Fiorucci)