deepset Gemma Chat Rag

Gemma Chat Rag

agentic-aiagenticagentsgenaiAIhaystack-cookbookgenai-usecaseshaystack-ainotebooksPythonragai-tools

alph-notebooks/haystack-cookbook / gemma_chat_rag.ipynb

Export

Run Notebooks

Contents

No cells yet

Add cells to see them here

Build with Gemma and Haystack

Gemma 4 is a family of great open models by Google.

Some interesting facts:

4 sizes (E2B, E4B, 26B4A, 31B)
Reasoning models
Support for tool calling and structured outputs
Multimodal
Apache 2.0 license
Up to 256K context window
Trained on 140+ languages

We will see some examples of what we can build with Gemma 4 and the Haystack LLM framework: from RAG to multimodal agents.

Installation

First of all, to have the model running fast enough, you need to Enable GPU Runtime in Colab.

Let's install the Python packages we need.

[ ]

In this notebook we'll use Ollama to serve Gemma 4 E4B. So we install it and download the model.

[12]

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
pciutils is already the newest version (1:3.7.0-6).
zstd is already the newest version (1.4.8+dfsg-3build1).
0 upgraded, 0 newly installed, 0 to remove and 42 not upgraded.
>>> Cleaning up old version at /usr/local/lib/ollama
>>> Installing ollama to /usr/local
>>> Downloading ollama-linux-amd64.tar.zst
######################################################################## 100.0%
>>> Adding ollama user to video group...
>>> Adding current user to ollama group...
>>> Creating ollama systemd service...
WARNING: systemd is not running
>>> NVIDIA GPU installed.
>>> The Ollama API is now available at 127.0.0.1:11434.
>>> Install complete. Run "ollama" from the command line.
ready

RAG with Gemma (about Rock music) 🎸

Why not start with some classic RAG? Let's see how this model performs.

In this notebook we'll use Ollama to serve a quantized version of Gemma 4 E4B. This model has 4.5B effective parameters (8B with embeddings) and is designed for on-device use.

There are also other options available to serve this model, like Transformers or llama.cpp. For production workloads, vLLM is recommended.

All of these options are supported in Haystack.

Load data from Wikipedia

[13]

Indexing Pipeline

[14]

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.

{'writer': {'documents_written': 662}}

[15]

{'title': 'Audioslave',
, 'url': 'https://en.wikipedia.org/wiki/Audioslave',
, 'source_id': 'ab025d6385ab46b73dcf198e52d084f6e62f0c210ea3ecca1ce317d4ac21858c',
, 'page_number': 1,
, 'split_id': 0,
, 'split_idx_start': 0}

RAG Pipeline

We set up the OllamaChatGenerator with think=False: RAG doesn't require extended reasoning here, so we disable thinking mode to keep answers fast. We also pass print_streaming_chunk as the streaming callback, a helper that prints tokens to stdout as they arrive.

[16]

<haystack.core.pipeline.pipeline.Pipeline object at 0x7afae4edcd10>
,🚅 Components
,  - retriever: InMemoryBM25Retriever
,  - prompt_builder: ChatPromptBuilder
,  - llm: OllamaChatGenerator
,🛤️ Connections
,  - retriever.documents -> prompt_builder.documents (list[Document])
,  - prompt_builder.prompt -> llm.messages (list[ChatMessage])

[ ]

[18]

[ASSISTANT]
The guest vocalist on Dire Straits' hit song "Money for Nothing" was Sting.

Source: https://en.wikipedia.org/wiki/Dire_Straits

[19]

[20]

Who was the lead singer of Audioslave?
[ASSISTANT]
The lead singer of Audioslave was Chris Cornell. Source: https://en.wikipedia.org/wiki/Audioslave

📷 Image Question Answering

Time to test the multimodal capabilities of this model.

Let's download the "Scaling Instruction-Finetuned Language Models" paper by Google and ask a question about a plot.

This time we'll use the model in thinking mode.

[ ]

[22]

[23]

[REASONING]
Thinking Process:

1.  **Analyze the Request:** The user wants to know the main takeaway of Figure 6, and the answer must be brief and accurate.
2.  **Analyze Figure 6 (The Image):**
    *   **Title/Caption:** "Figure 6: Zero-shot performance of PaLM and Flan-PaLM on a set of 23 challenging BIG-Bench tasks (BBH). Flan-PaLM benefits from chain-of-thought (CoT) generation activated via 'let's think step-by-step.'"
    *   **Axes:** Y-axis is BBH accuracy (%). X-axis shows different model sizes/conditions (8B, 62B, 540B).
    *   **Legend/Data Series:**
        *   Blue (PaLM: Zero-shot)
        *   Orange (PaLM: Zero-shot + CoT)
        *   Light Blue (Flan-PaLM: Zero-shot)
        *   Dark Blue (Flan-PaLM: Zero-shot + CoT)
    *   **Observe Trends:**
        *   *PaLM vs Flan-PaLM:* Flan-PaLM generally performs better than PaLM.
        *   *Zero-shot vs + CoT:* Adding CoT significantly boosts performance for *both* models (compare Blue to Orange, and Light Blue to Dark Blue).
        *   *Magnitude of Improvement:* The most noticeable gains are seen when Flan-PaLM is combined with CoT (Dark Blue bars).
3.  **Analyze the Surrounding Text (Contextualizing the Takeaway):**
    *   The text discusses CoT helping models perform better, especially for complex tasks.
    *   It specifically mentions that "Flan-PaLM benefits from CoT generation activated via 'let's think step-by-step.'"
4.  **Synthesize the Main Takeaway:**
    *   The primary finding is the effectiveness of Chain-of-Thought (CoT) prompting.
    *   *Which model benefits the most?* Flan-PaLM.
    *   *What is the core mechanism?* CoT prompting significantly improves zero-shot performance on challenging tasks (BBH).
5.  **Draft the Answer (Focusing on brevity and accuracy):** Flan-PaLM substantially improves zero-shot performance on challenging tasks (BBH) when Chain-of-Thought (CoT) prompting is used, outperforming both PaLM and the non-CoT versions of Flan-PaLM. (This is accurate but slightly long.)
6.  **Refine the Answer (Making it concise):** The main takeaway is that Chain-of-Thought (CoT) prompting significantly boosts the zero-shot performance of Flan-PaLM on complex tasks (BBH), making it superior to both base PaLM and Flan-PaLM models lacking CoT. (This is precise and brief.)

[ASSISTANT]
The main takeaway is that **Chain-of-Thought (CoT) prompting significantly boosts the zero-shot performance of Flan-PaLM** on complex tasks (BBH), demonstrating that providing the model with step-by-step reasoning ability is crucial for superior performance compared to base PaLM or Flan-PaLM models without CoT.

Nice! To better understand Haystack's multimodal capabilities, these tutorials are recommended:

⛅ Multimodal Weather Agent

We give Gemma a get_weather tool and a map image: the agent should identify the city from the image, then call the tool to retrieve current conditions.

A Haystack Agent is a LLM/ChatGenerator wrapped in an automatic tool-use loop:it executes the tools the model requests and feeds results back until the model produces a final answer.

[ ]

[26]

[27]

[28]

[REASONING]
The user wants me to perform three tasks based on the provided map:
1. Detect the city represented in the map.
2. Find the top 3 places to visit in that city.
3. Find the current weather for that city and include a brief description of it.

I have access to a tool: `get_weather(location: str)`.

**Step 1: Detect the city.**
I need to analyze the map image. The map shows a river (labeled "ARNO") and several prominent historical structures, including what looks like the Duomo (Florence Cathedral) and other major landmarks typical of Italy. A quick visual identification points strongly to Florence, Italy.

**Step 2: Find top 3 places to visit.**
Since the city is Florence, I can list 3 well-known attractions:
* The Duomo (Florence Cathedral)
* Ponte Vecchio
* Uffizi Gallery / Accademia Gallery (or Pitti Palace)

**Step 3: Get the weather.**
I need to use the `get_weather` tool with the location "Florence".

**Plan:**
1. Call `get_weather` for "Florence".
2. Wait for the tool output.
3. Compile the final response incorporating the city name, top 3 places, and the weather description.

[TOOL CALL]
Tool: get_weather
Arguments: {"location": "Florence"}

[TOOL RESULT]
{'description': 'Patchy rain nearby', 'temperature': 24, 'humidity': 47, 'precipitation': 0.0, 'wind_speed': 13, 'wind_direction': WindDirection.NORTH_NORTHEAST}

[ASSISTANT]
The city represented in the map is **Florence**, Italy.

**Top 3 Places to Visit:**
1. **Florence Cathedral (Duomo):** A masterpiece of Renaissance architecture, famous for Brunelleschi's dome.
2. **Ponte Vecchio:** A medieval bridge spanning the Arno River, historically known for its shops built along it.
3. **Uffizi Gallery:** One of the most important art museums in the world, housing incredible Renaissance paintings.

**Current Weather:**
The weather in Florence is currently **Patchy rain nearby**. The temperature is **24°C**, with a humidity of 47%. The wind speed is 13 km/h coming from the North-Northeast.

🔎💻 GitHub Search Agent

Now we'll build a GitHub search agent by connecting Gemma to the official GitHub MCP server. MCP (Model Context Protocol) is a standard way for LLMs to discover and call external tools.

To run this experiment, you'll need to pass a GitHub Personal Access Token with read permissions.

[ ]

[30]

[38]

The GitHub MCP server exposes many tools, and their schemas alone can easily exceed Ollama's default 4k context on Colab. As a first attempt, we bump num_ctx to 8192.

Another approach would be to hand-pick only the relevant tools, but we'd rather have the agent figure it out on its own.

If you're running on a powerful laptop, you should be able to use the model without encountering these problems.

[41]

[44]

[TOOL CALL]
Tool: search 
Arguments: {"query": "user:janekakle github"}

ERROR:haystack.components.tools.tool_invoker:Tool 'search' not found. Available tools: add_comment_to_pending_review, add_issue_comment, add_reply_to_pull_request_comment, assign_copilot_to_issue, create_branch, create_or_update_file, create_pull_request, create_pull_request_with_copilot, create_repository, delete_file, fork_repository, get_commit, get_copilot_job_status, get_file_contents, get_label, get_latest_release, get_me, get_release_by_tag, get_tag, get_team_members, get_teams, issue_read, issue_write, list_branches, list_commits, list_issue_types, list_issues, list_pull_requests, list_releases, list_tags, merge_pull_request, pull_request_read, pull_request_review_write, push_files, request_copilot_review, run_secret_scanning, search_code, search_issues, search_pull_requests, search_repositories, search_users, sub_issue_write, update_pull_request, update_pull_request_branch

[ASSISTANT]
What do you want to do? I can help with tasks like:

* **Reading code:** Finding information about repositories, issues, pull requests, etc.
* **Interacting with code:** Creating files, committing changes, creating issues, opening pull requests.
* **Searching/Browsing:** Searching for code or repositories.

Please specify your goal or the repository/context you are working with!

[45]

[GIN] 2026/04/16 - 15:35:43 | 200 | 10.336574493s |       127.0.0.1 | POST     "/api/chat"
time=2026-04-16T15:35:49.667Z level=WARN source=runner.go:187 msg="truncating input prompt" limit=8192 prompt=9233 keep=4 new=8192
[GIN] 2026/04/16 - 15:35:55 | 200 | 12.062062297s |       127.0.0.1 | POST     "/api/chat"
time=2026-04-16T15:36:33.791Z level=WARN source=runner.go:187 msg="truncating input prompt" limit=8192 prompt=9297 keep=4 new=8192
time=2026-04-16T15:36:39.443Z level=WARN source=gemma4.go:319 msg="gemma4 tool call flush on done failed" error="expected '{' in tool call" content="call:search_code(\"user:karpathy projects\")\n"
[GIN] 2026/04/16 - 15:36:39 | 200 |  10.92845609s |       127.0.0.1 | POST     "/api/chat"
time=2026-04-16T15:36:59.274Z level=WARN source=runner.go:187 msg="truncating input prompt" limit=8192 prompt=9297 keep=4 new=8192
[GIN] 2026/04/16 - 15:37:05 | 200 | 10.722681345s |       127.0.0.1 | POST     "/api/chat"
time=2026-04-16T15:37:13.018Z level=WARN source=runner.go:187 msg="truncating input prompt" limit=8192 prompt=9269 keep=4 new=8192
[GIN] 2026/04/16 - 15:37:20 | 200 | 15.361395497s |       127.0.0.1 | POST     "/api/chat"

Oh no!

The prompt got truncated, so the model lost part of the tool catalog and ended up hallucinating a non-existent search tool.

Time to try Haystack's SearchableToolset, a class that lets agents dynamically discover tools from large catalogs via keyword-based search. Tool schemas are loaded on demand, so the context stays lean.

[46]

[47]

[REASONING]
The user is asking for the "5 most popular projects by karpathy user" and their focus, requesting a very short answer.

I need to determine if I have a tool that can access this kind of information (e.g., GitHub profile analysis, popular projects list, specific user statistics).

Since I don't have any initial tools loaded, I must use the `search_tools` tool first to see what capabilities are available. The key entities are "karpathy", "projects", and "popular".

I will use "karpathy", "projects", and "github" as keywords.

[TOOL CALL]
Tool: search_tools
Arguments: {"k": 3, "tool_keywords": "karpathy github projects"}

[TOOL RESULT]
Found and loaded 3 tool(s): search_repositories, list_branches, list_releases. Use them directly as tools.

[REASONING]
The search tools revealed that I have `search_repositories`, `list_branches`, and `list_releases`. These tools are related to GitHub operations.

To find the "5 most popular projects by karpathy user," the most appropriate tool is `search_repositories`, as it allows searching for repositories by user, popularity (sorting by stars), and keywords.

The user specified "by karpathy user." I will assume "karpathy" is the username.

I will use `search_repositories` with:
1. `query`: `user:karpathy` (to filter by the user)
2. `sort`: `stars` (to find the "most popular")
3. `perPage`: 5 (to get the top 5)

I will execute this plan.

[TOOL CALL]
Tool: search_repositories
Arguments: {"query": "user:karpathy", "sort": "stars", "perPage": 5}

[TOOL RESULT]
{"meta":null,"content":[{"type":"text","text":"{\"total_count\":54,\"incomplete_results\":false,\"items\":[{\"id\":1174820787,\"name\":\"autoresearch\",\"full_name\":\"karpathy/autoresearch\",\"description\":\"AI agents running research on single-GPU nanochat training automatically\",\"html_url\":\"https://github.com/karpathy/autoresearch\",\"language\":\"Python\",\"stargazers_count\":73206,\"forks_count\":10669,\"open_issues_count\":181,\"updated_at\":\"2026-04-16T15:36:48Z\",\"created_at\":\"2026-03-06T22:00:43Z\",\"private\":false,\"fork\":false,\"archived\":false,\"default_branch\":\"master\"},{\"id\":582822129,\"name\":\"nanoGPT\",\"full_name\":\"karpathy/nanoGPT\",\"description\":\"The simplest, fastest repository for training/finetuning medium-sized GPTs.\",\"html_url\":\"https://github.com/karpathy/nanoGPT\",\"language\":\"Python\",\"stargazers_count\":56738,\"forks_count\":9705,\"open_issues_count\":343,\"updated_at\":\"2026-04-16T15:21:16Z\",\"created_at\":\"2022-12-28T00:51:12Z\",\"private\":false,\"fork\":false,\"archived\":false,\"default_branch\":\"master\"},{\"id\":1075431749,\"name\":\"nanochat\",\"full_name\":\"karpathy/nanochat\",\"description\":\"The best ChatGPT that $100 can buy.\",\"html_url\":\"https://github.com/karpathy/nanochat\",\"language\":\"Python\",\"stargazers_count\":51962,\"forks_count\":6905,\"open_issues_count\":92,\"updated_at\":\"2026-04-16T15:36:58Z\",\"created_at\":\"2025-10-13T13:46:35Z\",\"private\":false,\"fork\":false,\"archived\":false,\"default_branch\":\"master\"},{\"id\":806315903,\"name\":\"LLM101n\",\"full_name\":\"karpathy/LLM101n\",\"description\":\"LLM101n: Let's build a Storyteller\",\"html_url\":\"https://github.com/karpathy/LLM101n\",\"stargazers_count\":36787,\"forks_count\":2010,\"open_issues_count\":19,\"updated_at\":\"2026-04-16T15:29:21Z\",\"created_at\":\"2024-05-27T00:23:38Z\",\"private\":false,\"fork\":false,\"archived\":true,\"default_branch\":\"master\"},{\"id\":783833344,\"name\":\"llm.c\",\"full_name\":\"karpathy/llm.c\",\"description\":\"LLM training in simple, raw C/CUDA\",\"html_url\":\"https://github.com/karpathy/llm.c\",\"language\":\"Cuda\",\"stargazers_count\":29595,\"forks_count\":3528,\"open_issues_count\":218,\"updated_at\":\"2026-04-16T15:20:01Z\",\"created_at\":\"2024-04-08T16:58:11Z\",\"private\":false,\"fork\":false,\"archived\":false,\"default_branch\":\"master\"}]}","annotations":null,"meta":null}],"structuredContent":null,"isError":false}

[REASONING]
I have successfully retrieved the 5 most starred repositories for the user `karpathy`.

The results are:
1. **autoresearch**: AI agents running research on single-GPU nanochat training automatically (Python)
2. **nanoGPT**: The simplest, fastest repository for training/finetuning medium-sized GPTs. (Python)
3. **nanochat**: The best ChatGPT that $100 can buy. (Python)
4. **LLM101n**: LLM101n: Let's build a Storyteller (N/A, but the description is provided)
5. **llm.c**: LLM training in simple, raw C/CUDA (Cuda)

The user requested a "Very short answer" listing the 5 projects and their focus. I can compile this information directly from the `description` field provided in the API response.

Plan: Format the extracted data into a concise list.

[ASSISTANT]
The 5 most popular projects by karpathy, and their focus, are:

1. **autoresearch**: AI agents running research on single-GPU nanochat training automatically.
2. **nanoGPT**: The simplest, fastest repository for training/finetuning medium-sized GPTs.
3. **nanochat**: A highly capable local ChatGPT implementation.
4. **LLM101n**: A project focused on building a Storyteller using large language models.
5. **llm.c**: Implementing LLM training using raw C/CUDA for efficiency.

Well done! 🎉

In this notebook we built four things with Gemma 4 and Haystack: a RAG pipeline over Wikipedia, an image QA pipeline on a PDF plot, a multimodal weather agent that reads a map, and a GitHub agent that discovers its tools on demand via MCP.

For more, check the Haystack docs, tutorials and cookbook.

(Notebook by Stefano Fiorucci)