Query 1

alph-notebooks/arize-prompt-learning / query_1.ipynb

Export

Run Notebooks

Contents

No cells yet

Add cells to see them here

[1]

/opt/anaconda3/envs/hover-benchmark/lib/python3.11/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm

[ ]

[3]

[4]

[5]

[ ]

[7]

[8]

[9]

[10]

llm_generate |██████████| 150/150 (100.0%) | ⏳ 00:39<00:00 |  3.77it/s
llm_generate |██████████| 150/150 (100.0%) | ⏳ 00:17<00:00 |  8.79it/s
llm_generate |██████████| 150/150 (100.0%) | ⏳ 00:15<00:00 |  9.60it/s
llm_generate |██████████| 150/150 (100.0%) | ⏳ 00:25<00:00 |  5.97it/s
llm_generate |██████████| 150/150 (100.0%) | ⏳ 00:32<00:00 |  4.57it/s

f1: 0.658888222888223, em: 0.5866666666666667, prec: 0.6722433862433863, recall: 0.6628232323232323

llm_generate |██████████| 150/150 (100.0%) | ⏳ 00:34<00:00 |  4.32it/s

['question', 'question_id', 'gold_answer', 'supporting_facts', 'query_1', 'passages_1', 'summary_1', 'query_2', 'passages_2', 'summary_2', 'final_answer', 'evals', 'queries']

🔧 Creating batches with 100,000 token limit
📊 Processing 150 examples in 3 batches

You are an expert in multi-hop question-answering and prompt optimization. 
Your job is to improve a question-answering system that operates on the HotpotQA benchmark — a dataset that requires multi-hop reasoning across multiple Wikipedia articles.

====================
🏗️ SYSTEM DESCRIPTION
====================
The system answers complex factual questions by chaining together multiple reasoning and retrieval steps. 
It operates as follows:

1. **Input Question:** The system begins with a natural language question from the HotpotQA dataset.
2. **Query Generation:** The system generates a first query ({query_1}) to retrieve relevant documents from a fixed corpus.
3. **Retrieval (BM25-based Search):**
   The search tool is a static BM25 retriever implemented as:

stemmer = Stemmer.Stemmer("english")
retriever = bm25s.BM25.load("/Users/priyanjindal/prompt-learning/benchmarks/hotpotQA/wiki17_abstracts", corpus_name="wiki17_abstracts_corpus.jsonl", load_corpus=True)
corpus = retriever.corpus

def search(query: str, k: int) -> list[dict]:
tokens = bm25s.tokenize(query, stopwords="en", stemmer=stemmer, show_progress=False)
results, scores = retriever.retrieve(tokens, k=k, n_threads=1, show_progress=False)
formatted_results = []
for doc in results[0]:
text = doc['text']
if " | " not in text:
return []
title, content = text.split(" | ", 1)
formatted_results.append({"title": title, "content": content})
return formatted_results

- This search function retrieves top-{k} Wikipedia abstracts.
- Each document has a `title` (the Wikipedia page title) and `content` (the text of the abstract).
- The retriever is *static* — meaning only the **queries** can be improved, not the retrieval algorithm itself.

4. **Summarization:** The retrieved passages are summarized to highlight key facts ({summary_1}, {summary_2}).
5. **Second-Hop Query Generation:** The system generates a follow-up query ({query_2}) based on the question and first summary to gather additional evidence.
6. **Final Answer Generation:** The model combines all retrieved evidence to produce a final answer ({final_answer}).

====================
🎯 OPTIMIZATION GOAL
====================
Your task is to optimize the *prompts* used in each reasoning component (query generation, summarization, and answer synthesis) so that the system retrieves the correct evidence and produces accurate final answers. 

In particular:
- The search component cannot change — only the *language and structure of the queries* affect retrieval quality.
- Good prompts guide the model to generate **precise, entity-rich, multi-hop-aware queries** that retrieve all *supporting facts*.
- Summaries should **preserve factual links** across hops, not just paraphrase content.
- The final answer prompt should encourage **faithful synthesis** of retrieved information.

====================
📄 YOUR INPUTS
====================
Below is the baseline prompt currently being used, along with example runs and feedback. 
Use these to identify weaknesses and propose improvements.

************* start prompts *************
{baseline_prompt}
************* end prompts *************

************* start example data *************
{examples}
************* end example data *************

HERE ARE SOME ANNOTATIONS THAT MAY BE HELPFUL:
{annotations}

====================
🔧 FINAL INSTRUCTIONS
====================
Iterate on the baseline prompt to produce a **new, improved prompts** that:
- Retains all variable placeholders (e.g., {question}, {query_1}, {summary_1}, etc.).
- Produces clearer, more factually grounded reasoning and retrieval.
- Encourages entity completeness (e.g., names, dates, titles, relations) and multi-hop connections.
- Remains faithful to the output schema and return format from the original prompt.
- Includes short, high-quality few-shot examples or guidelines if relevant.
- Return the prompts in the same formatting that they were given.
Note: Make sure to include the variables from the original prompt, which are wrapped in either single brackets or double brackets (e.g.
{var}). If you fail to include these variables, the LLM will not be able to access the required data.

NEW PROMPTS:

   ✅ Batch 1/3: Optimized

You are an expert in multi-hop question-answering and prompt optimization. 
Your job is to improve a question-answering system that operates on the HotpotQA benchmark — a dataset that requires multi-hop reasoning across multiple Wikipedia articles.

====================
🏗️ SYSTEM DESCRIPTION
====================
The system answers complex factual questions by chaining together multiple reasoning and retrieval steps. 
It operates as follows:

1. **Input Question:** The system begins with a natural language question from the HotpotQA dataset.
2. **Query Generation:** The system generates a first query ({query_1}) to retrieve relevant documents from a fixed corpus.
3. **Retrieval (BM25-based Search):**
   The search tool is a static BM25 retriever implemented as:

stemmer = Stemmer.Stemmer("english")
retriever = bm25s.BM25.load("/Users/priyanjindal/prompt-learning/benchmarks/hotpotQA/wiki17_abstracts", corpus_name="wiki17_abstracts_corpus.jsonl", load_corpus=True)
corpus = retriever.corpus

def search(query: str, k: int) -> list[dict]:
tokens = bm25s.tokenize(query, stopwords="en", stemmer=stemmer, show_progress=False)
results, scores = retriever.retrieve(tokens, k=k, n_threads=1, show_progress=False)
formatted_results = []
for doc in results[0]:
text = doc['text']
if " | " not in text:
return []
title, content = text.split(" | ", 1)
formatted_results.append({"title": title, "content": content})
return formatted_results

- This search function retrieves top-{k} Wikipedia abstracts.
- Each document has a `title` (the Wikipedia page title) and `content` (the text of the abstract).
- The retriever is *static* — meaning only the **queries** can be improved, not the retrieval algorithm itself.

4. **Summarization:** The retrieved passages are summarized to highlight key facts ({summary_1}, {summary_2}).
5. **Second-Hop Query Generation:** The system generates a follow-up query ({query_2}) based on the question and first summary to gather additional evidence.
6. **Final Answer Generation:** The model combines all retrieved evidence to produce a final answer ({final_answer}).

====================
🎯 OPTIMIZATION GOAL
====================
Your task is to optimize the *prompts* used in each reasoning component (query generation, summarization, and answer synthesis) so that the system retrieves the correct evidence and produces accurate final answers. 

In particular:
- The search component cannot change — only the *language and structure of the queries* affect retrieval quality.
- Good prompts guide the model to generate **precise, entity-rich, multi-hop-aware queries** that retrieve all *supporting facts*.
- Summaries should **preserve factual links** across hops, not just paraphrase content.
- The final answer prompt should encourage **faithful synthesis** of retrieved information.

====================
📄 YOUR INPUTS
====================
Below is the baseline prompt currently being used, along with example runs and feedback. 
Use these to identify weaknesses and propose improvements.

************* start prompts *************
{baseline_prompt}
************* end prompts *************

************* start example data *************
{examples}
************* end example data *************

HERE ARE SOME ANNOTATIONS THAT MAY BE HELPFUL:
{annotations}

====================
🔧 FINAL INSTRUCTIONS
====================
Iterate on the baseline prompt to produce a **new, improved prompts** that:
- Retains all variable placeholders (e.g., {question}, {query_1}, {summary_1}, etc.).
- Produces clearer, more factually grounded reasoning and retrieval.
- Encourages entity completeness (e.g., names, dates, titles, relations) and multi-hop connections.
- Remains faithful to the output schema and return format from the original prompt.
- Includes short, high-quality few-shot examples or guidelines if relevant.
- Return the prompts in the same formatting that they were given.
Note: Make sure to include the variables from the original prompt, which are wrapped in either single brackets or double brackets (e.g.
{var}). If you fail to include these variables, the LLM will not be able to access the required data.

NEW PROMPTS:

   ✅ Batch 2/3: Optimized

You are an expert in multi-hop question-answering and prompt optimization. 
Your job is to improve a question-answering system that operates on the HotpotQA benchmark — a dataset that requires multi-hop reasoning across multiple Wikipedia articles.

====================
🏗️ SYSTEM DESCRIPTION
====================
The system answers complex factual questions by chaining together multiple reasoning and retrieval steps. 
It operates as follows:

1. **Input Question:** The system begins with a natural language question from the HotpotQA dataset.
2. **Query Generation:** The system generates a first query ({query_1}) to retrieve relevant documents from a fixed corpus.
3. **Retrieval (BM25-based Search):**
   The search tool is a static BM25 retriever implemented as:

stemmer = Stemmer.Stemmer("english")
retriever = bm25s.BM25.load("/Users/priyanjindal/prompt-learning/benchmarks/hotpotQA/wiki17_abstracts", corpus_name="wiki17_abstracts_corpus.jsonl", load_corpus=True)
corpus = retriever.corpus

def search(query: str, k: int) -> list[dict]:
tokens = bm25s.tokenize(query, stopwords="en", stemmer=stemmer, show_progress=False)
results, scores = retriever.retrieve(tokens, k=k, n_threads=1, show_progress=False)
formatted_results = []
for doc in results[0]:
text = doc['text']
if " | " not in text:
return []
title, content = text.split(" | ", 1)
formatted_results.append({"title": title, "content": content})
return formatted_results

- This search function retrieves top-{k} Wikipedia abstracts.
- Each document has a `title` (the Wikipedia page title) and `content` (the text of the abstract).
- The retriever is *static* — meaning only the **queries** can be improved, not the retrieval algorithm itself.

4. **Summarization:** The retrieved passages are summarized to highlight key facts ({summary_1}, {summary_2}).
5. **Second-Hop Query Generation:** The system generates a follow-up query ({query_2}) based on the question and first summary to gather additional evidence.
6. **Final Answer Generation:** The model combines all retrieved evidence to produce a final answer ({final_answer}).

====================
🎯 OPTIMIZATION GOAL
====================
Your task is to optimize the *prompts* used in each reasoning component (query generation, summarization, and answer synthesis) so that the system retrieves the correct evidence and produces accurate final answers. 

In particular:
- The search component cannot change — only the *language and structure of the queries* affect retrieval quality.
- Good prompts guide the model to generate **precise, entity-rich, multi-hop-aware queries** that retrieve all *supporting facts*.
- Summaries should **preserve factual links** across hops, not just paraphrase content.
- The final answer prompt should encourage **faithful synthesis** of retrieved information.

====================
📄 YOUR INPUTS
====================
Below is the baseline prompt currently being used, along with example runs and feedback. 
Use these to identify weaknesses and propose improvements.

************* start prompts *************
{baseline_prompt}
************* end prompts *************

************* start example data *************
{examples}
************* end example data *************

HERE ARE SOME ANNOTATIONS THAT MAY BE HELPFUL:
{annotations}

====================
🔧 FINAL INSTRUCTIONS
====================
Iterate on the baseline prompt to produce a **new, improved prompts** that:
- Retains all variable placeholders (e.g., {question}, {query_1}, {summary_1}, etc.).
- Produces clearer, more factually grounded reasoning and retrieval.
- Encourages entity completeness (e.g., names, dates, titles, relations) and multi-hop connections.
- Remains faithful to the output schema and return format from the original prompt.
- Includes short, high-quality few-shot examples or guidelines if relevant.
- Return the prompts in the same formatting that they were given.
Note: Make sure to include the variables from the original prompt, which are wrapped in either single brackets or double brackets (e.g.
{var}). If you fail to include these variables, the LLM will not be able to access the required data.

NEW PROMPTS:

   ✅ Batch 3/3: Optimized
NEW PROMPTS:

create_query_1_prompt = You are generating a single keyword-style search query for a BM25 retriever over Wikipedia abstracts. Use {question}.
- Identify: (a) target entity/answer type (person/date/number/title/place/list/yes-no), (b) key entities and relations (author-of, member-of, birthplace, cast/role, founded year), (c) any constraints (timeframe, medium/type like song/album/film/novel/city/league, country).
- Disambiguate aggressively:
  - Put exact titles in quotes; add medium/type words and years if implied (e.g., "The Other Side" song vs album; "A Death in Vienna" novel).
  - For people with common names, add role or affiliation (e.g., actor Tombstone, boxer).
  - For places, add state/country or league/division.
- Multi-hop awareness:
  - If the question asks about a property of an entity that must first be identified (e.g., author of a work, team of a player), include both the known entity and the relation in one query (e.g., '"A Death in Vienna" novel author').
  - For comparisons or “which/what” between multiple entities, include all entities plus the property in one query.
- Use relation keywords: author, creator, cast, role, birth date, birthplace, population 2010 census, founded year, member count, creators, owners, address.
- Prefer franchise or continuity disambiguators when relevant (e.g., Disney, sequel, series, league).
- Do not output code or pseudo-code. Output exactly one plain-text query string only (no extra punctuation besides quotes).
Guidelines (examples, do not output these):
- Q: Who wrote the novel that the film City of Ember is based on? -> Query: "City of Ember" film novel author Jeanne DuPrau
- Q: Which magazine was published first, Harpies and Quines or Harper's Bazaar? -> Query: "Harpies and Quines" first issue year vs "Harper's Bazaar" 1867 magazine
- Q: The Other Side was released by the band founded in what year? -> Query: "The Other Side" song band founding year OR "The Other Side" album 1927 band founding year (include medium)
- Q: How many novels has the author of "A Death in Vienna" written? -> Query: "A Death in Vienna" novel author bibliography number of novels
- Q: Which actor from 'Tombstone' was born May 17, 1955? -> Query: "Tombstone" film cast actor born May 17, 1955
- Q: Which brothel originally known as Mustang Bridge Ranch is in Clark, Nevada? -> Query: Clark Nevada brothel "Mustang Bridge Ranch" Mustang Ranch
- Q: What is the address of the theatre near the Old Post Office Block in Manchester? -> Query: Palace Theatre Manchester New Hampshire address "Hanover Street"

summarize_1_prompt = Read {passages_1} and extract only facts that help answer {question}.
- Write concise bullets as subject–verb–object triples. Prefix each bullet with the source page title in brackets. Include medium/type and years where relevant, e.g., [The Other Side (1927 album)] is a 1990 album by 1927 (band).
- Preserve exact titles, names, dates, numbers; include aliases if present.
- Explicitly note ambiguities or multiple candidates and how they differ (e.g., song vs album; different authors with same title).
- Prefer authoritative sources (page title exactly matches the entity) but list all candidates if uncertain.
- If the question implies constraints (timeframe, franchise/continuity, jurisdiction), add a bullet:
  - Question constraints: <state any constraints inferred from the question (e.g., medium, year, timeframe like “during 1927, 1930”, franchise like Disney, census year 2010)>
- End with two lines:
  - Known entities: <exact titles/names identified so far (include medium/type)>
  - Next-hop focus: <specific missing link to look up next (e.g., "identify author of 'A Death in Vienna' (novel)", "Boston Red Sox World Series wins during 1927 or 1930", "Baloo second Disney film production year", "address of Palace Theatre Manchester NH")>
- Do not speculate or add unrelated background.

create_query_2_prompt = Use {question} and {summary_1} to generate a single improved search query.
- Use exact strings from Known entities; directly target the gap in Next-hop focus.
- Add strong disambiguators: medium/type (song/album/film/novel/league/city), franchise (Disney), roles (author/cast), locations (city/state/country), and years/timeframes mentioned under Question constraints.
- For comparisons, include all entities plus the property (e.g., "first issue year", "older than").
- For canonical properties (member count, founding year, population 2010), query the exact entity page plus the property keywords (e.g., "Boston Red Sox" World Series championships; "Ivy League" number of schools).
- If ambiguity remains, include alternative titles/aliases in quotes in the same query.
- Do not output code/SQL. Output exactly one plain-text query string only.
Guidelines (examples, do not output these):
- Next-hop: EastEnders creators -> Query: "EastEnders" creators Julia Smith Tony Holland
- Next-hop: identify '71 actor and Peaky Blinders role -> Query: "'71" film cast Peaky Blinders role Arthur Shelby Paul Anderson
- Next-hop: Baloo second Disney movie production year -> Query: Baloo character Disney "The Jungle Book 2" 2003
- Next-hop: owner/operator of Las Vegas Convention Center -> Query: "Las Vegas Convention Center" owner operator "Las Vegas Convention and Visitors Authority"
- Next-hop: author of "A Death in Vienna" (novel) -> Query: "A Death in Vienna" novel author Daniel Silva bibliography number of novels

summarize_2_prompt = Read {passages_2} in light of {question} and {summary_1}.
- Compile final, question-focused bullets with exact names/dates/titles/numbers. Prefix each bullet with the source page title in brackets.
- Resolve ambiguity by preferring the most specific, authoritative passage (exact entity page or the page for the property itself). Note conflicts briefly and state which source is most direct/specific and why (e.g., city page for population, film page for cast/role, award page for recipients).
- Ensure constraints from Question constraints are satisfied (e.g., timeframe during a player’s tenure, Disney continuity, census year).
- If multiple candidates remain, list them with distinguishing details.
- End with one line: Answer candidate(s): <verbatim minimal candidate value(s) complying with constraints>.
- Produce only the summary.

final_answer_prompt = Given {question}, {summary_1}, {summary_2}, output the minimal, precise answer only.
- Determine the answer type (yes/no, person, date, number, title, place, list) and extract it verbatim from the summaries.
- Answer only if the exact value appears explicitly and matches any constraints (timeframe, medium/franchise, location). If evidence conflicts or is missing after two hops, output "Unknown".
- When multiple candidates exist but one is best supported by the most authoritative, most specific source (per summarize_2), choose that one.
- For yes/no, output "yes" or "no" only. For names, dates, numbers, titles, or lists, output only that content (comma-separated for lists) with original capitalization/spelling.
- Prefer exact phrasing expected by the question (e.g., "Mustang Ranch", "August 25, 1867"). For counts, output the numeral as stated in the selected passage that satisfies constraints.
- Do not include any extra words, punctuation, or explanations.

llm_generate |█████████▉| 299/300 (99.7%) | ⏳ 00:42<00:00 | 10.44it/s

Exception in worker on attempt 1: raised APITimeoutError('Request timed out.')
Requeuing...

llm_generate |██████████| 300/300 (100.0%) | ⏳ 00:38<00:00 |  7.82it/s
llm_generate |██████████| 300/300 (100.0%) | ⏳ 00:27<00:00 | 10.74it/s
                                                                       
                                                                       
llm_generate |██████████| 300/300 (100.0%) | ⏳ 02:33<00:00 | 10.44it/s

Exception in worker on attempt 1: raised APITimeoutError('Request timed out.')
Requeuing...


































































































































llm_generate |██████████| 300/300 (100.0%) | ⏳ 00:28<00:00 | 10.52it/s

f1: 0.5788519791755087, em: 0.47333333333333333, prec: 0.5767152657284238, recall: 0.6198631553631554

llm_generate |██████████| 300/300 (100.0%) | ⏳ 03:10<00:00 |  1.58it/s
llm_generate |██████████| 300/300 (100.0%) | ⏳ 01:16<00:00 |  3.94it/s
llm_generate |██████████| 150/150 (100.0%) | ⏳ 00:15<00:00 |  9.67it/s
llm_generate |██████████| 150/150 (100.0%) | ⏳ 00:18<00:00 |  8.24it/s
llm_generate |██████████| 150/150 (100.0%) | ⏳ 00:19<00:00 |  7.75it/s
llm_generate |██████████| 150/150 (100.0%) | ⏳ 00:21<00:00 |  6.99it/s
llm_generate |██████████| 150/150 (100.0%) | ⏳ 00:15<00:00 |  9.61it/s

f1: 0.6935193325193325, em: 0.58, prec: 0.6891111111111111, recall: 0.7247222222222222

llm_generate |██████████| 150/150 (100.0%) | ⏳ 00:36<00:00 |  4.10it/s

['question', 'question_id', 'gold_answer', 'supporting_facts', 'query_1', 'passages_1', 'summary_1', 'query_2', 'passages_2', 'summary_2', 'final_answer', 'evals', 'queries']

🔧 Creating batches with 100,000 token limit
📊 Processing 150 examples in 3 batches

You are an expert in multi-hop question-answering and prompt optimization. 
Your job is to improve a question-answering system that operates on the HotpotQA benchmark — a dataset that requires multi-hop reasoning across multiple Wikipedia articles.

====================
🏗️ SYSTEM DESCRIPTION
====================
The system answers complex factual questions by chaining together multiple reasoning and retrieval steps. 
It operates as follows:

1. **Input Question:** The system begins with a natural language question from the HotpotQA dataset.
2. **Query Generation:** The system generates a first query ({query_1}) to retrieve relevant documents from a fixed corpus.
3. **Retrieval (BM25-based Search):**
   The search tool is a static BM25 retriever implemented as:

stemmer = Stemmer.Stemmer("english")
retriever = bm25s.BM25.load("/Users/priyanjindal/prompt-learning/benchmarks/hotpotQA/wiki17_abstracts", corpus_name="wiki17_abstracts_corpus.jsonl", load_corpus=True)
corpus = retriever.corpus

def search(query: str, k: int) -> list[dict]:
tokens = bm25s.tokenize(query, stopwords="en", stemmer=stemmer, show_progress=False)
results, scores = retriever.retrieve(tokens, k=k, n_threads=1, show_progress=False)
formatted_results = []
for doc in results[0]:
text = doc['text']
if " | " not in text:
return []
title, content = text.split(" | ", 1)
formatted_results.append({"title": title, "content": content})
return formatted_results

- This search function retrieves top-{k} Wikipedia abstracts.
- Each document has a `title` (the Wikipedia page title) and `content` (the text of the abstract).
- The retriever is *static* — meaning only the **queries** can be improved, not the retrieval algorithm itself.

4. **Summarization:** The retrieved passages are summarized to highlight key facts ({summary_1}, {summary_2}).
5. **Second-Hop Query Generation:** The system generates a follow-up query ({query_2}) based on the question and first summary to gather additional evidence.
6. **Final Answer Generation:** The model combines all retrieved evidence to produce a final answer ({final_answer}).

====================
🎯 OPTIMIZATION GOAL
====================
Your task is to optimize the *prompts* used in each reasoning component (query generation, summarization, and answer synthesis) so that the system retrieves the correct evidence and produces accurate final answers. 

In particular:
- The search component cannot change — only the *language and structure of the queries* affect retrieval quality.
- Good prompts guide the model to generate **precise, entity-rich, multi-hop-aware queries** that retrieve all *supporting facts*.
- Summaries should **preserve factual links** across hops, not just paraphrase content.
- The final answer prompt should encourage **faithful synthesis** of retrieved information.

====================
📄 YOUR INPUTS
====================
Below is the baseline prompt currently being used, along with example runs and feedback. 
Use these to identify weaknesses and propose improvements.

************* start prompts *************
{baseline_prompt}
************* end prompts *************

************* start example data *************
{examples}
************* end example data *************

HERE ARE SOME ANNOTATIONS THAT MAY BE HELPFUL:
{annotations}

====================
🔧 FINAL INSTRUCTIONS
====================
Iterate on the baseline prompt to produce a **new, improved prompts** that:
- Retains all variable placeholders (e.g., {question}, {query_1}, {summary_1}, etc.).
- Produces clearer, more factually grounded reasoning and retrieval.
- Encourages entity completeness (e.g., names, dates, titles, relations) and multi-hop connections.
- Remains faithful to the output schema and return format from the original prompt.
- Includes short, high-quality few-shot examples or guidelines if relevant.
- Return the prompts in the same formatting that they were given.
Note: Make sure to include the variables from the original prompt, which are wrapped in either single brackets or double brackets (e.g.
{var}). If you fail to include these variables, the LLM will not be able to access the required data.

NEW PROMPTS:

   ✅ Batch 1/3: Optimized

You are an expert in multi-hop question-answering and prompt optimization. 
Your job is to improve a question-answering system that operates on the HotpotQA benchmark — a dataset that requires multi-hop reasoning across multiple Wikipedia articles.

====================
🏗️ SYSTEM DESCRIPTION
====================
The system answers complex factual questions by chaining together multiple reasoning and retrieval steps. 
It operates as follows:

1. **Input Question:** The system begins with a natural language question from the HotpotQA dataset.
2. **Query Generation:** The system generates a first query ({query_1}) to retrieve relevant documents from a fixed corpus.
3. **Retrieval (BM25-based Search):**
   The search tool is a static BM25 retriever implemented as:

stemmer = Stemmer.Stemmer("english")
retriever = bm25s.BM25.load("/Users/priyanjindal/prompt-learning/benchmarks/hotpotQA/wiki17_abstracts", corpus_name="wiki17_abstracts_corpus.jsonl", load_corpus=True)
corpus = retriever.corpus

def search(query: str, k: int) -> list[dict]:
tokens = bm25s.tokenize(query, stopwords="en", stemmer=stemmer, show_progress=False)
results, scores = retriever.retrieve(tokens, k=k, n_threads=1, show_progress=False)
formatted_results = []
for doc in results[0]:
text = doc['text']
if " | " not in text:
return []
title, content = text.split(" | ", 1)
formatted_results.append({"title": title, "content": content})
return formatted_results

- This search function retrieves top-{k} Wikipedia abstracts.
- Each document has a `title` (the Wikipedia page title) and `content` (the text of the abstract).
- The retriever is *static* — meaning only the **queries** can be improved, not the retrieval algorithm itself.

4. **Summarization:** The retrieved passages are summarized to highlight key facts ({summary_1}, {summary_2}).
5. **Second-Hop Query Generation:** The system generates a follow-up query ({query_2}) based on the question and first summary to gather additional evidence.
6. **Final Answer Generation:** The model combines all retrieved evidence to produce a final answer ({final_answer}).

====================
🎯 OPTIMIZATION GOAL
====================
Your task is to optimize the *prompts* used in each reasoning component (query generation, summarization, and answer synthesis) so that the system retrieves the correct evidence and produces accurate final answers. 

In particular:
- The search component cannot change — only the *language and structure of the queries* affect retrieval quality.
- Good prompts guide the model to generate **precise, entity-rich, multi-hop-aware queries** that retrieve all *supporting facts*.
- Summaries should **preserve factual links** across hops, not just paraphrase content.
- The final answer prompt should encourage **faithful synthesis** of retrieved information.

====================
📄 YOUR INPUTS
====================
Below is the baseline prompt currently being used, along with example runs and feedback. 
Use these to identify weaknesses and propose improvements.

************* start prompts *************
{baseline_prompt}
************* end prompts *************

************* start example data *************
{examples}
************* end example data *************

HERE ARE SOME ANNOTATIONS THAT MAY BE HELPFUL:
{annotations}

====================
🔧 FINAL INSTRUCTIONS
====================
Iterate on the baseline prompt to produce a **new, improved prompts** that:
- Retains all variable placeholders (e.g., {question}, {query_1}, {summary_1}, etc.).
- Produces clearer, more factually grounded reasoning and retrieval.
- Encourages entity completeness (e.g., names, dates, titles, relations) and multi-hop connections.
- Remains faithful to the output schema and return format from the original prompt.
- Includes short, high-quality few-shot examples or guidelines if relevant.
- Return the prompts in the same formatting that they were given.
Note: Make sure to include the variables from the original prompt, which are wrapped in either single brackets or double brackets (e.g.
{var}). If you fail to include these variables, the LLM will not be able to access the required data.

NEW PROMPTS:

   ✅ Batch 2/3: Optimized

You are an expert in multi-hop question-answering and prompt optimization. 
Your job is to improve a question-answering system that operates on the HotpotQA benchmark — a dataset that requires multi-hop reasoning across multiple Wikipedia articles.

====================
🏗️ SYSTEM DESCRIPTION
====================
The system answers complex factual questions by chaining together multiple reasoning and retrieval steps. 
It operates as follows:

1. **Input Question:** The system begins with a natural language question from the HotpotQA dataset.
2. **Query Generation:** The system generates a first query ({query_1}) to retrieve relevant documents from a fixed corpus.
3. **Retrieval (BM25-based Search):**
   The search tool is a static BM25 retriever implemented as:

stemmer = Stemmer.Stemmer("english")
retriever = bm25s.BM25.load("/Users/priyanjindal/prompt-learning/benchmarks/hotpotQA/wiki17_abstracts", corpus_name="wiki17_abstracts_corpus.jsonl", load_corpus=True)
corpus = retriever.corpus

def search(query: str, k: int) -> list[dict]:
tokens = bm25s.tokenize(query, stopwords="en", stemmer=stemmer, show_progress=False)
results, scores = retriever.retrieve(tokens, k=k, n_threads=1, show_progress=False)
formatted_results = []
for doc in results[0]:
text = doc['text']
if " | " not in text:
return []
title, content = text.split(" | ", 1)
formatted_results.append({"title": title, "content": content})
return formatted_results

- This search function retrieves top-{k} Wikipedia abstracts.
- Each document has a `title` (the Wikipedia page title) and `content` (the text of the abstract).
- The retriever is *static* — meaning only the **queries** can be improved, not the retrieval algorithm itself.

4. **Summarization:** The retrieved passages are summarized to highlight key facts ({summary_1}, {summary_2}).
5. **Second-Hop Query Generation:** The system generates a follow-up query ({query_2}) based on the question and first summary to gather additional evidence.
6. **Final Answer Generation:** The model combines all retrieved evidence to produce a final answer ({final_answer}).

====================
🎯 OPTIMIZATION GOAL
====================
Your task is to optimize the *prompts* used in each reasoning component (query generation, summarization, and answer synthesis) so that the system retrieves the correct evidence and produces accurate final answers. 

In particular:
- The search component cannot change — only the *language and structure of the queries* affect retrieval quality.
- Good prompts guide the model to generate **precise, entity-rich, multi-hop-aware queries** that retrieve all *supporting facts*.
- Summaries should **preserve factual links** across hops, not just paraphrase content.
- The final answer prompt should encourage **faithful synthesis** of retrieved information.

====================
📄 YOUR INPUTS
====================
Below is the baseline prompt currently being used, along with example runs and feedback. 
Use these to identify weaknesses and propose improvements.

************* start prompts *************
{baseline_prompt}
************* end prompts *************

************* start example data *************
{examples}
************* end example data *************

HERE ARE SOME ANNOTATIONS THAT MAY BE HELPFUL:
{annotations}

====================
🔧 FINAL INSTRUCTIONS
====================
Iterate on the baseline prompt to produce a **new, improved prompts** that:
- Retains all variable placeholders (e.g., {question}, {query_1}, {summary_1}, etc.).
- Produces clearer, more factually grounded reasoning and retrieval.
- Encourages entity completeness (e.g., names, dates, titles, relations) and multi-hop connections.
- Remains faithful to the output schema and return format from the original prompt.
- Includes short, high-quality few-shot examples or guidelines if relevant.
- Return the prompts in the same formatting that they were given.
Note: Make sure to include the variables from the original prompt, which are wrapped in either single brackets or double brackets (e.g.
{var}). If you fail to include these variables, the LLM will not be able to access the required data.

NEW PROMPTS:

   ✅ Batch 3/3: Optimized
create_query_1_prompt = You are generating a single high-recall, keyword-style search query for a BM25 retriever over Wikipedia abstracts. Use {question}.
- Objective: write one entity-rich query that surfaces the core entities AND the likely pivot pages (exact entity page, list/filmography/discography/roster/awards pages, cast/role pages, founding/formation pages) needed to answer the question.
- First, identify: (a) answer type (person/date/number/title/place/list/yes-no/category/comparison), (b) all key entities and relations (author-of, creator-of, member-of, birthplace, award/award category, founding/charter year, publication/formation year, cast/role, nickname/alias, original/birth name), (c) constraints (timeframe/year, medium/type such as film/novel/album/magazine/comic/video game/city/league, jurisdiction/country/state/council, franchise/continuity).
- Disambiguate aggressively with exact strings and type words:
  - Put exact titles/names in quotes; add medium/type in parentheses where canonical (e.g., Editors (band), "Der Wildschütz" opera, "Peak FM (North Derbyshire)" radio station).
  - For ambiguous names or punctuation-heavy titles, include normalized OR-variants in quotes within the same query: e.g., '"OK! (magazine)" OR "OK magazine"'; include accented/unaccented or punctuation-stripped variants when helpful.
  - For people with common names, add role/affiliation (actor, footballer, DJ, cosmonaut) and one notable work in quotes.
  - For works with common titles, add medium and artist/franchise (e.g., '"The Other Side" song OR single artist band').
- Recall-first multi-hop strategy in one query:
  - If the answer depends on a property of a second entity, include both the pivot entity and the property, and target overview/list pages: discography, filmography, roster, list of X, award pages, founding pages.
  - Include strong synonyms and near-variants to avoid over-constraining: album/compilation/EP/single; founded/established/formation year; birthplace/place of birth/hometown; publisher/owner/operator; first issue year/publication year/release date; headquarters city/head office.
  - For comparisons (“which/what … first/older/founded”), include all entities plus the property and comparison keywords: earlier than/older than/first issue year/published first.
  - For timelines (“worked before/joined from”), include the person + “previous club” OR “prior to” OR “before” + the year/season + role (coach/manager).
  - For counts/how many, include the canonical entity and “number of” or “how many” plus the container entity keyword (e.g., schools, albums, members).
  - For location questions (“where/which city/country”), include the entity name plus “country” OR “located in” OR “headquarters city”.
- Authority targeting and breadth control:
  - Prefer exact entity pages, then official list/award/roster pages, then overview pages. If ambiguity remains, include multiple plausible candidates joined by OR in a single query.
- Do not output code or explanations. Output exactly one plain-text query string only (no trailing punctuation beyond quotes).
Guidelines (examples, do not output these):
- Q: Which was published first, The Fader or OK!? -> Query: '"The Fader" magazine first issue year OR founding year vs "OK! (magazine)" OR "OK magazine" first issue year'
- Q: Who was born first, Alex Segal or Mika Kaurismäki? -> Query: '"Alex Segal" birth year OR born vs "Mika Kaurismäki" born 1955 birth year'
- Q: The Other Side was released by the band founded in what year? -> Query: '"The Other Side" song OR single artist band founding year formation year'
- Q: Which actor from '71 also played Arthur Shelby in Peaky Blinders? -> Query: '"\'71" (film) cast actor "Arthur Shelby" "Peaky Blinders" Paul Anderson'
- Q: Panathinaikos F.C.’s new coach worked where before? -> Query: '"Henk ten Cate" managerial career previous club prior to 2008 Panathinaikos coach before 2008'


summarize_1_prompt = Read {passages_1} and extract only facts that help answer {question}.
- Write concise bullets as subject–verb–object triples. Prefix each bullet with the source page title in brackets. Include medium/type and years where relevant (e.g., [The Other Side (1927 album)] is a 1990 album by 1927 (band)).
- Preserve exact titles, names, dates, numbers; include aliases if present (birth/original names, nicknames). Include both punctuated and normalized forms where relevant (e.g., OK! (magazine) / OK magazine).
- For each bullet, indicate authority in parentheses at the end: (exact entity page / official list/award/roster page / overview page / unrelated).
- Explicitly note ambiguities, conflicts, or multiple candidates and how they differ (song vs album; magazine vs manga; city vs council; studio vs compilation; standard vs extended measurements).
- If the question is a comparison (older/first/which), list the deciding property for each entity side-by-side (e.g., first issue year for both magazines; birth years for both people).
- If a key entity implied by the question is missing (e.g., second item in a comparison; the pivot entity like an author or owner; the specific cast/role), add a bullet:
  - Missing evidence: <state the missing entity/page/property we still need>.
- If the question implies constraints (timeframe, franchise/continuity, jurisdiction, revived/relaunched title, award category), add a bullet:
  - Question constraints: <state constraints verbatim>.
- Add an “Entity map” bullet to link roles across entities when helpful (work -> author; actor -> role; team -> league; suburb -> council -> establishment date).
- End with two lines:
  - Known entities: <exact titles/names identified so far (include medium/type) and authoritative variants (Editors (band), "Peak FM (North Derbyshire)", OK! (magazine) / OK magazine)>
  - Next-hop focus: <precise gap to look up next (e.g., "publication years for both The Fader and OK! (magazine)", "Henk ten Cate previous club before June 2008 (managerial timeline)", "is 'The Other Side' the Aerosmith or Pendulum song (work page)", "owner of Peak FM and owner HQ city", "formation year of Editors (band)")>
- Do not speculate or add unrelated background. If no relevant facts exist in the passages, state Missing evidence accordingly.


create_query_2_prompt = Use {question} and {summary_1} to generate a single improved search query.
- Target the exact gap in Next-hop focus using exact strings from Known entities; include disambiguators (medium/type qualifiers in parentheses), roles (author/cast/composer/librettist/coach), and jurisdictions/years from Question constraints.
- Include normalized OR-variants for ambiguous or punctuated names directly in the same query (e.g., '"OK! (magazine)" OR "OK magazine"').
- Calibrate breadth:
  - If Missing evidence indicates we lack the key pivot entity (author/publisher/owner/actor/artist), query directly for that entity and the property (e.g., '"A Death in Vienna" novel author'; '"Peak FM (North Derbyshire)" owner headquarters city').
  - If the first hop over-constrained (e.g., only promo/compilation surfaced), broaden with synonyms and list pages: discography/filmography/roster/“list of”/awards/career timeline.
  - For timeline questions, include prior-to/previous/before terms plus the year/season (e.g., '"Henk ten Cate" previous club before 2008 Panathinaikos').
- For comparisons, include all entities plus the property keywords (“first issue year”, “published first”, “older than”, “birth year”).
- For canonical properties (member count, founding/charter year, population 2010, address), query the exact entity page plus property keywords (e.g., '"Ivy League" number of schools', '"Florida Panthers" founded 1993 founding year').
- For lists or set membership, include “list of”, “complete list”, or the container entity (e.g., 'list of suburbs "City of Port Adelaide Enfield" north-western').
- Avoid special-case extensions unless asked (e.g., for river lengths prefer standard “length” rather than tributary-extended totals).
- Do not output code/SQL. Output exactly one plain-text query string only.
Guidelines (examples, do not output these):
- Next-hop: publication years for The Fader and OK! -> Query: '"The Fader" magazine first issue year OR founding year vs "OK! (magazine)" OR "OK magazine" first issue year'
- Next-hop: identify '71 actor and Peaky Blinders role -> Query: '"\'71" (film) cast actor "Arthur Shelby" "Peaky Blinders" Paul Anderson'
- Next-hop: Red Hot Chili Peppers 1989 album on EMI (studio vs compilation) -> Query: '"Red Hot Chili Peppers" discography 1989 album "Mother\'s Milk" EMI studio album'
- Next-hop: owner of Peak FM and owner HQ city -> Query: '"Peak FM (North Derbyshire)" owner "Wireless Group" headquarters city Belfast'
- Next-hop: who was born first -> Query: '"Alex Segal" birth year OR born vs "Mika Kaurismäki" birth year born 1955'


summarize_2_prompt = Read {passages_2} in light of {question} and {summary_1}.
- Compile final, question-focused bullets with exact names/dates/titles/numbers. Prefix each bullet with the source page title in brackets and indicate authority (exact entity/list/overview/unrelated).
- Resolve ambiguity by preferring the most specific, authoritative passage (exact entity page, official list/roster/award page, or the page for the property itself). Note conflicts briefly and state which source is most direct/specific and why (e.g., main entity page for standard length; film page for cast/role; award page for recipients).
- Ensure all constraints from Question constraints are satisfied (timeframe during a player’s tenure, correct medium/franchise, revived/relaunched year, chartered year, correct jurisdiction/council).
- For comparisons, explicitly compute which satisfies the comparison and state the deciding facts side-by-side (e.g., first issue year 1867 vs 1994 → Harper’s Bazaar first; birth years 1926 vs 1955 → 1926 is older). Add a “Computation” bullet showing the decision.
- For category/slot-fill questions (e.g., “are a breed of what?”, “John Smith ________ in Tadcaster”), include both the general class (dog, brewery) and any specific subgroup (working dog) if present; prefer the most general class when the question asks “what kind of/what”.
- For location “where” questions, return the place (city/region/country), not distances or directions.
- If the first hop lacked a key entity and it still does not appear, add a bullet:
  - Remaining gap: <state missing entity/property>.
- For measurements with multiple reported values, prefer the standard/conventional value from the canonical entity page unless the question specifies an alternative (e.g., do not include tributary-extended river lengths unless asked).
- End with one line: Answer candidate(s): <verbatim minimal candidate value(s) complying with constraints>. If truly unresolved, include “Unknown”.
- Produce only the summary.


final_answer_prompt = Given {question}, {summary_1}, {summary_2}, output the minimal, precise answer only.
- Determine the answer type (yes/no, person, date, number, title, place, list, category/comparison) and extract it verbatim from the summaries. For comparisons, output the single winning entity/value computed in summarize_2.
- Output an answer only if the exact value appears explicitly in the summaries (or is a direct comparison between explicit values) and matches all constraints (timeframe, medium/franchise, location, chartered year). Do not use outside knowledge.
- If summaries flag Missing evidence/Remaining gap, or if one side of a required comparison is absent, or if the value is not explicitly supported, output "Unknown".
- When multiple candidates exist but one is best supported by the most authoritative, most specific source (per summarize_2), choose that one; if multiple equal candidates remain for a singular “which/what” question, output "Unknown".
- For names or titles that include nicknames/aliases in the selected passage, include them as shown (e.g., Alan "Sparrow" Taylor).
- For fill-in-the-blank/type questions (“John Smith ________ in Tadcaster”), return the class/type (e.g., brewery) rather than a full proper name if that best fills the blank and is explicitly stated in the summaries.
- For “how many” questions, output the numeral only. For comparisons (“which published first/older than”), output the single winning entity/value only.
- Do not include any extra words, punctuation, or explanations.

llm_generate |██████████| 300/300 (100.0%) | ⏳ 00:28<00:00 | 10.34it/s
llm_generate |██████████| 300/300 (100.0%) | ⏳ 00:32<00:00 |  9.33it/s
llm_generate |██████████| 300/300 (100.0%) | ⏳ 00:27<00:00 | 10.72it/s
llm_generate |██████████| 300/300 (100.0%) | ⏳ 00:32<00:00 |  9.34it/s
llm_generate |██████████| 300/300 (100.0%) | ⏳ 00:27<00:00 | 10.82it/s

f1: 0.5163326036392959, em: 0.42, prec: 0.5122939421689422, recall: 0.5511450216450218

llm_generate |██████████| 150/150 (100.0%) | ⏳ 00:16<00:00 |  9.25it/s
llm_generate |██████████| 150/150 (100.0%) | ⏳ 00:25<00:00 |  5.82it/s
llm_generate |██████████| 150/150 (100.0%) | ⏳ 00:16<00:00 |  9.36it/s
llm_generate |██████████| 150/150 (100.0%) | ⏳ 00:22<00:00 |  6.64it/s
llm_generate |██████████| 150/150 (100.0%) | ⏳ 00:15<00:00 |  9.49it/s

f1: 0.6146190476190476, em: 0.54, prec: 0.6160331890331892, recall: 0.6360555555555556

llm_generate |█████████▉| 149/150 (99.3%) | ⏳ 00:54<00:02 |  2.30s/it

Exception in worker on attempt 1: raised APITimeoutError('Request timed out.')
Requeuing...

llm_generate |██████████| 150/150 (100.0%) | ⏳ 01:00<00:00 |  5.90s/it

['question', 'question_id', 'gold_answer', 'supporting_facts', 'query_1', 'passages_1', 'summary_1', 'query_2', 'passages_2', 'summary_2', 'final_answer', 'evals', 'queries']

🔧 Creating batches with 100,000 token limit
📊 Processing 150 examples in 4 batches

You are an expert in multi-hop question-answering and prompt optimization. 
Your job is to improve a question-answering system that operates on the HotpotQA benchmark — a dataset that requires multi-hop reasoning across multiple Wikipedia articles.

====================
🏗️ SYSTEM DESCRIPTION
====================
The system answers complex factual questions by chaining together multiple reasoning and retrieval steps. 
It operates as follows:

1. **Input Question:** The system begins with a natural language question from the HotpotQA dataset.
2. **Query Generation:** The system generates a first query ({query_1}) to retrieve relevant documents from a fixed corpus.
3. **Retrieval (BM25-based Search):**
   The search tool is a static BM25 retriever implemented as:

stemmer = Stemmer.Stemmer("english")
retriever = bm25s.BM25.load("/Users/priyanjindal/prompt-learning/benchmarks/hotpotQA/wiki17_abstracts", corpus_name="wiki17_abstracts_corpus.jsonl", load_corpus=True)
corpus = retriever.corpus

def search(query: str, k: int) -> list[dict]:
tokens = bm25s.tokenize(query, stopwords="en", stemmer=stemmer, show_progress=False)
results, scores = retriever.retrieve(tokens, k=k, n_threads=1, show_progress=False)
formatted_results = []
for doc in results[0]:
text = doc['text']
if " | " not in text:
return []
title, content = text.split(" | ", 1)
formatted_results.append({"title": title, "content": content})
return formatted_results

- This search function retrieves top-{k} Wikipedia abstracts.
- Each document has a `title` (the Wikipedia page title) and `content` (the text of the abstract).
- The retriever is *static* — meaning only the **queries** can be improved, not the retrieval algorithm itself.

4. **Summarization:** The retrieved passages are summarized to highlight key facts ({summary_1}, {summary_2}).
5. **Second-Hop Query Generation:** The system generates a follow-up query ({query_2}) based on the question and first summary to gather additional evidence.
6. **Final Answer Generation:** The model combines all retrieved evidence to produce a final answer ({final_answer}).

====================
🎯 OPTIMIZATION GOAL
====================
Your task is to optimize the *prompts* used in each reasoning component (query generation, summarization, and answer synthesis) so that the system retrieves the correct evidence and produces accurate final answers. 

In particular:
- The search component cannot change — only the *language and structure of the queries* affect retrieval quality.
- Good prompts guide the model to generate **precise, entity-rich, multi-hop-aware queries** that retrieve all *supporting facts*.
- Summaries should **preserve factual links** across hops, not just paraphrase content.
- The final answer prompt should encourage **faithful synthesis** of retrieved information.

====================
📄 YOUR INPUTS
====================
Below is the baseline prompt currently being used, along with example runs and feedback. 
Use these to identify weaknesses and propose improvements.

************* start prompts *************
{baseline_prompt}
************* end prompts *************

************* start example data *************
{examples}
************* end example data *************

HERE ARE SOME ANNOTATIONS THAT MAY BE HELPFUL:
{annotations}

====================
🔧 FINAL INSTRUCTIONS
====================
Iterate on the baseline prompt to produce a **new, improved prompts** that:
- Retains all variable placeholders (e.g., {question}, {query_1}, {summary_1}, etc.).
- Produces clearer, more factually grounded reasoning and retrieval.
- Encourages entity completeness (e.g., names, dates, titles, relations) and multi-hop connections.
- Remains faithful to the output schema and return format from the original prompt.
- Includes short, high-quality few-shot examples or guidelines if relevant.
- Return the prompts in the same formatting that they were given.
Note: Make sure to include the variables from the original prompt, which are wrapped in either single brackets or double brackets (e.g.
{var}). If you fail to include these variables, the LLM will not be able to access the required data.

NEW PROMPTS:

   ✅ Batch 1/4: Optimized

You are an expert in multi-hop question-answering and prompt optimization. 
Your job is to improve a question-answering system that operates on the HotpotQA benchmark — a dataset that requires multi-hop reasoning across multiple Wikipedia articles.

====================
🏗️ SYSTEM DESCRIPTION
====================
The system answers complex factual questions by chaining together multiple reasoning and retrieval steps. 
It operates as follows:

1. **Input Question:** The system begins with a natural language question from the HotpotQA dataset.
2. **Query Generation:** The system generates a first query ({query_1}) to retrieve relevant documents from a fixed corpus.
3. **Retrieval (BM25-based Search):**
   The search tool is a static BM25 retriever implemented as:

stemmer = Stemmer.Stemmer("english")
retriever = bm25s.BM25.load("/Users/priyanjindal/prompt-learning/benchmarks/hotpotQA/wiki17_abstracts", corpus_name="wiki17_abstracts_corpus.jsonl", load_corpus=True)
corpus = retriever.corpus

def search(query: str, k: int) -> list[dict]:
tokens = bm25s.tokenize(query, stopwords="en", stemmer=stemmer, show_progress=False)
results, scores = retriever.retrieve(tokens, k=k, n_threads=1, show_progress=False)
formatted_results = []
for doc in results[0]:
text = doc['text']
if " | " not in text:
return []
title, content = text.split(" | ", 1)
formatted_results.append({"title": title, "content": content})
return formatted_results

- This search function retrieves top-{k} Wikipedia abstracts.
- Each document has a `title` (the Wikipedia page title) and `content` (the text of the abstract).
- The retriever is *static* — meaning only the **queries** can be improved, not the retrieval algorithm itself.

4. **Summarization:** The retrieved passages are summarized to highlight key facts ({summary_1}, {summary_2}).
5. **Second-Hop Query Generation:** The system generates a follow-up query ({query_2}) based on the question and first summary to gather additional evidence.
6. **Final Answer Generation:** The model combines all retrieved evidence to produce a final answer ({final_answer}).

====================
🎯 OPTIMIZATION GOAL
====================
Your task is to optimize the *prompts* used in each reasoning component (query generation, summarization, and answer synthesis) so that the system retrieves the correct evidence and produces accurate final answers. 

In particular:
- The search component cannot change — only the *language and structure of the queries* affect retrieval quality.
- Good prompts guide the model to generate **precise, entity-rich, multi-hop-aware queries** that retrieve all *supporting facts*.
- Summaries should **preserve factual links** across hops, not just paraphrase content.
- The final answer prompt should encourage **faithful synthesis** of retrieved information.

====================
📄 YOUR INPUTS
====================
Below is the baseline prompt currently being used, along with example runs and feedback. 
Use these to identify weaknesses and propose improvements.

************* start prompts *************
{baseline_prompt}
************* end prompts *************

************* start example data *************
{examples}
************* end example data *************

HERE ARE SOME ANNOTATIONS THAT MAY BE HELPFUL:
{annotations}

====================
🔧 FINAL INSTRUCTIONS
====================
Iterate on the baseline prompt to produce a **new, improved prompts** that:
- Retains all variable placeholders (e.g., {question}, {query_1}, {summary_1}, etc.).
- Produces clearer, more factually grounded reasoning and retrieval.
- Encourages entity completeness (e.g., names, dates, titles, relations) and multi-hop connections.
- Remains faithful to the output schema and return format from the original prompt.
- Includes short, high-quality few-shot examples or guidelines if relevant.
- Return the prompts in the same formatting that they were given.
Note: Make sure to include the variables from the original prompt, which are wrapped in either single brackets or double brackets (e.g.
{var}). If you fail to include these variables, the LLM will not be able to access the required data.

NEW PROMPTS:

   ✅ Batch 2/4: Optimized

You are an expert in multi-hop question-answering and prompt optimization. 
Your job is to improve a question-answering system that operates on the HotpotQA benchmark — a dataset that requires multi-hop reasoning across multiple Wikipedia articles.

====================
🏗️ SYSTEM DESCRIPTION
====================
The system answers complex factual questions by chaining together multiple reasoning and retrieval steps. 
It operates as follows:

1. **Input Question:** The system begins with a natural language question from the HotpotQA dataset.
2. **Query Generation:** The system generates a first query ({query_1}) to retrieve relevant documents from a fixed corpus.
3. **Retrieval (BM25-based Search):**
   The search tool is a static BM25 retriever implemented as:

stemmer = Stemmer.Stemmer("english")
retriever = bm25s.BM25.load("/Users/priyanjindal/prompt-learning/benchmarks/hotpotQA/wiki17_abstracts", corpus_name="wiki17_abstracts_corpus.jsonl", load_corpus=True)
corpus = retriever.corpus

def search(query: str, k: int) -> list[dict]:
tokens = bm25s.tokenize(query, stopwords="en", stemmer=stemmer, show_progress=False)
results, scores = retriever.retrieve(tokens, k=k, n_threads=1, show_progress=False)
formatted_results = []
for doc in results[0]:
text = doc['text']
if " | " not in text:
return []
title, content = text.split(" | ", 1)
formatted_results.append({"title": title, "content": content})
return formatted_results

- This search function retrieves top-{k} Wikipedia abstracts.
- Each document has a `title` (the Wikipedia page title) and `content` (the text of the abstract).
- The retriever is *static* — meaning only the **queries** can be improved, not the retrieval algorithm itself.

4. **Summarization:** The retrieved passages are summarized to highlight key facts ({summary_1}, {summary_2}).
5. **Second-Hop Query Generation:** The system generates a follow-up query ({query_2}) based on the question and first summary to gather additional evidence.
6. **Final Answer Generation:** The model combines all retrieved evidence to produce a final answer ({final_answer}).

====================
🎯 OPTIMIZATION GOAL
====================
Your task is to optimize the *prompts* used in each reasoning component (query generation, summarization, and answer synthesis) so that the system retrieves the correct evidence and produces accurate final answers. 

In particular:
- The search component cannot change — only the *language and structure of the queries* affect retrieval quality.
- Good prompts guide the model to generate **precise, entity-rich, multi-hop-aware queries** that retrieve all *supporting facts*.
- Summaries should **preserve factual links** across hops, not just paraphrase content.
- The final answer prompt should encourage **faithful synthesis** of retrieved information.

====================
📄 YOUR INPUTS
====================
Below is the baseline prompt currently being used, along with example runs and feedback. 
Use these to identify weaknesses and propose improvements.

************* start prompts *************
{baseline_prompt}
************* end prompts *************

************* start example data *************
{examples}
************* end example data *************

HERE ARE SOME ANNOTATIONS THAT MAY BE HELPFUL:
{annotations}

====================
🔧 FINAL INSTRUCTIONS
====================
Iterate on the baseline prompt to produce a **new, improved prompts** that:
- Retains all variable placeholders (e.g., {question}, {query_1}, {summary_1}, etc.).
- Produces clearer, more factually grounded reasoning and retrieval.
- Encourages entity completeness (e.g., names, dates, titles, relations) and multi-hop connections.
- Remains faithful to the output schema and return format from the original prompt.
- Includes short, high-quality few-shot examples or guidelines if relevant.
- Return the prompts in the same formatting that they were given.
Note: Make sure to include the variables from the original prompt, which are wrapped in either single brackets or double brackets (e.g.
{var}). If you fail to include these variables, the LLM will not be able to access the required data.

NEW PROMPTS:

   ✅ Batch 3/4: Optimized

You are an expert in multi-hop question-answering and prompt optimization. 
Your job is to improve a question-answering system that operates on the HotpotQA benchmark — a dataset that requires multi-hop reasoning across multiple Wikipedia articles.

====================
🏗️ SYSTEM DESCRIPTION
====================
The system answers complex factual questions by chaining together multiple reasoning and retrieval steps. 
It operates as follows:

1. **Input Question:** The system begins with a natural language question from the HotpotQA dataset.
2. **Query Generation:** The system generates a first query ({query_1}) to retrieve relevant documents from a fixed corpus.
3. **Retrieval (BM25-based Search):**
   The search tool is a static BM25 retriever implemented as:

stemmer = Stemmer.Stemmer("english")
retriever = bm25s.BM25.load("/Users/priyanjindal/prompt-learning/benchmarks/hotpotQA/wiki17_abstracts", corpus_name="wiki17_abstracts_corpus.jsonl", load_corpus=True)
corpus = retriever.corpus

def search(query: str, k: int) -> list[dict]:
tokens = bm25s.tokenize(query, stopwords="en", stemmer=stemmer, show_progress=False)
results, scores = retriever.retrieve(tokens, k=k, n_threads=1, show_progress=False)
formatted_results = []
for doc in results[0]:
text = doc['text']
if " | " not in text:
return []
title, content = text.split(" | ", 1)
formatted_results.append({"title": title, "content": content})
return formatted_results

- This search function retrieves top-{k} Wikipedia abstracts.
- Each document has a `title` (the Wikipedia page title) and `content` (the text of the abstract).
- The retriever is *static* — meaning only the **queries** can be improved, not the retrieval algorithm itself.

4. **Summarization:** The retrieved passages are summarized to highlight key facts ({summary_1}, {summary_2}).
5. **Second-Hop Query Generation:** The system generates a follow-up query ({query_2}) based on the question and first summary to gather additional evidence.
6. **Final Answer Generation:** The model combines all retrieved evidence to produce a final answer ({final_answer}).

====================
🎯 OPTIMIZATION GOAL
====================
Your task is to optimize the *prompts* used in each reasoning component (query generation, summarization, and answer synthesis) so that the system retrieves the correct evidence and produces accurate final answers. 

In particular:
- The search component cannot change — only the *language and structure of the queries* affect retrieval quality.
- Good prompts guide the model to generate **precise, entity-rich, multi-hop-aware queries** that retrieve all *supporting facts*.
- Summaries should **preserve factual links** across hops, not just paraphrase content.
- The final answer prompt should encourage **faithful synthesis** of retrieved information.

====================
📄 YOUR INPUTS
====================
Below is the baseline prompt currently being used, along with example runs and feedback. 
Use these to identify weaknesses and propose improvements.

************* start prompts *************
{baseline_prompt}
************* end prompts *************

************* start example data *************
{examples}
************* end example data *************

HERE ARE SOME ANNOTATIONS THAT MAY BE HELPFUL:
{annotations}

====================
🔧 FINAL INSTRUCTIONS
====================
Iterate on the baseline prompt to produce a **new, improved prompts** that:
- Retains all variable placeholders (e.g., {question}, {query_1}, {summary_1}, etc.).
- Produces clearer, more factually grounded reasoning and retrieval.
- Encourages entity completeness (e.g., names, dates, titles, relations) and multi-hop connections.
- Remains faithful to the output schema and return format from the original prompt.
- Includes short, high-quality few-shot examples or guidelines if relevant.
- Return the prompts in the same formatting that they were given.
Note: Make sure to include the variables from the original prompt, which are wrapped in either single brackets or double brackets (e.g.
{var}). If you fail to include these variables, the LLM will not be able to access the required data.

NEW PROMPTS:

   ✅ Batch 4/4: Optimized
************* start prompts *************
************* start prompts *************
************* start prompts *************
************* start prompts *************

create_query_1_prompt = You are generating a single high-recall, keyword-style search query for a BM25 retriever over Wikipedia abstracts. Use {question}.
- Objective: write one entity-rich query that surfaces the core entities AND the likely pivot pages (exact entity page; list/filmography/discography/roster/awards pages; cast/role pages; formation/founding/establishment pages; acquisition/ownership pages) needed to answer the question.
- First, identify: (a) answer type (person/date/year/number/title/place/list/yes-no/category/comparison), (b) all key entities and relations (author/screenwriter; composer/librettist; member-of; cast/role; birthplace/nationality vs country; award/award category; founding/formation year vs first season; publication/release year; owner/acquirer; list/roster membership), (c) constraints (timeframe/year; medium/type such as film/novel/album/song/TV series/magazine/board game/video game/city/league; jurisdiction/country/state/council; birth/death date filters; “more than/older than/first/which supports more” comparisons).
- Disambiguate aggressively with exact strings and type words:
  - Put exact titles/names in quotes; add medium/type and year where canonical: e.g., "Tombstone" (1993 film), "OK! (magazine)".
  - For ambiguous titles, include medium + year + creator/artist OR add multiple plausible variants joined by OR in one query: e.g., '"The Other Side" song OR single OR track'.
  - For people with common names, add role/affiliation and one notable work in quotes (actor; footballer; DJ; astronaut; composer).
  - Include normalized OR-variants for punctuated/named entities (OK! (magazine) OR "OK magazine"; "Ain't" OR "Aint").
- Recall-first multi-hop strategy in one query:
  - If the answer depends on a property of a second entity, include both the pivot entity and the property, and target overview/list pages: filmography/discography/roster, cast pages, award pages, founding/formation pages, acquisition/ownership pages.
  - For role-bridging questions, include: '"X" (year/medium) cast OR actors' plus filmography keywords (filmography, appeared in) and property synonyms (e.g., extraterrestrial OR alien OR UFO; owner OR acquired by; number of players).
  - For comparisons, include all entities plus the deciding property keywords (“first issue year”, “published first”, “older than”, “birth year”, “number of players”, “player count”).
  - For formation/founding, include the entity page + property keywords: founded OR formed OR established OR "first season" OR "joined [league]".
  - For screenwriter/composer/librettist, include entity and role keywords: writer OR screenwriter OR "written by"; composer OR "music by"; librettist.
  - For list/set membership (suburbs within a council; games that support more than two players), include “list of” or “roster” or the container entity (league/council/city) when relevant.
  - For nationality vs country, include “nationality” OR “born” OR “born in” and the person’s exact name.
- Precision controls for BM25:
  - Prefer exact entity pages and specific property keywords (“cast”, “screenwriter”, “player count”, “first issue”, “founded”, “owner”, “acquired by”, “platforms”, “release”).
  - Avoid injecting unrelated brand/manufacturer/platform family names unless they are the property being asked (e.g., don’t add “Sony Interactive Entertainment” unless ownership/brand is the focus).
  - Prefer property words over vague comparison words alone.
- Geographic and naming disambiguation:
  - If a place or team name could refer to multiple things, include qualifiers (city/state/county/league) and nearby anchors; for sports teams include “NHL”/“NBA”/“MLS”/“Premier League” as appropriate.
- Do not output code or explanations. Output exactly one plain-text query string only (no trailing punctuation beyond quotes).

Guidelines (examples, do not output these):
- Q: Which was published first, The Fader or OK!? -> Query: '"The Fader" magazine first issue year OR founding year vs "OK! (magazine)" OR "OK magazine" first issue year'
- Q: Which actor from Tombstone was born on May 17, 1955? -> Query: '"Tombstone" (1993 film) cast actor born "May 17, 1955"'
- Q: Which game can be played by more than 2 players at a time, Game of the Generals or Ludo? -> Query: '"Game of the Generals" number of players OR "player count" vs "Ludo" board game players OR "player count"'
- Q: What extraterrestrial creature is featured in a film in which the "Happy New Year" actor also appears? -> Query: '"Happy New Year" (2014 film) cast actor filmography extraterrestrial OR alien creature'
- Q: The Other Side was released by the band founded in what year? -> Query: '"The Other Side" song OR single artist band founding year OR formation year discography'

summarize_1_prompt = Read {passages_1} and extract only facts that help answer {question}.
- Write concise bullets as subject–verb–object triples. Prefix each bullet with the source page title in brackets. Include medium/type and years where relevant (e.g., [The Other Side (1927 album)] is a 1990 album by 1927 (band)).
- Preserve exact titles, names, dates, numbers; include aliases if present (birth/original names, nicknames). Include punctuated and normalized forms where relevant (OK! (magazine) / OK magazine).
- After each bullet, mark source authority in parentheses: (exact entity page / official list/award/roster/cast/property page / overview page / unrelated).
- Explicitly note ambiguities, conflicts, multiple candidates, or homonymous titles (song vs album; magazine vs manga; 1983 film vs 2009 film) and how they differ (year, medium, country, creator).
- For comparison questions (older/first/which), list the deciding property for each entity side-by-side with the entity names.
- If a key entity implied by the question is missing (second item in a comparison; the pivot entity like an author, owner, screenwriter/composer/librettist; a specific cast/role; the team’s name), add:
  - Missing evidence: <state the missing entity/page/property we still need>.
- If the question implies constraints (timeframe, franchise/continuity, jurisdiction/council/county, revived/relaunched title, canonical vs first season dates, category vs subtype, “more than 2 players”), add:
  - Question constraints: <state constraints verbatim>.
- Add an “Entity map” bullet to link roles across entities when helpful (work -> author/screenwriter/composer/librettist; actor -> role; team -> league; suburb -> council; game -> player count; person -> nationality/birthplace).
- Prefer canonical/standard values, but record all distinct values with their sources if conflicts appear.
- End with two lines:
  - Known entities: <exact titles/names identified so far (include medium/type) and authoritative variants>
  - Next-hop focus: <precise gap to look up next (e.g., "publication years for The Fader and OK! (magazine)", "identify actor from Tombstone born May 17, 1955", "player counts for Game of the Generals and Ludo", "screenwriter of Dirty Dancing (1987 film)", "owner and HQ city of Peak FM (North Derbyshire)", "traditional length of River Thames (miles)")>
- Do not speculate or add unrelated background; mark any non-pertinent passage as (unrelated). If no relevant facts exist in the passages, state Missing evidence accordingly.

create_query_2_prompt = Use {question} and {summary_1} to generate a single improved search query.
- Target the exact gap in Next-hop focus using exact strings from Known entities; include disambiguators (medium/type qualifiers in parentheses), roles (author/screenwriter/cast/composer/librettist/coach/owner), and jurisdictions/years from Question constraints.
- Include normalized OR-variants for ambiguous or punctuated names directly in the same query (e.g., '"OK! (magazine)" OR "OK magazine"').
- Calibrate breadth:
  - If Missing evidence indicates we lack the key pivot entity (author/publisher/owner/actor/artist/composer/librettist/screenwriter/team name), query directly for that entity and the property (e.g., '"A Death in Vienna" novel author'; '"Dirty Dancing" (1987 film) writer OR screenwriter "written by"').
  - If the first hop over-constrained or hit the wrong homonym, broaden or correct with year/medium/country and synonyms, and include multiple plausible variants joined by OR.
  - For comparisons, include all entities plus the deciding property keywords (“first issue year”, “published first”, “older than”, “birth year”, “number of players”, “player count”).
  - For lists or set membership (suburbs within a council; magazines someone was featured in), include “list of”, “complete list”, or the container entity (e.g., 'list of suburbs "City of Port Adelaide Enfield" north-western').
  - For acquisition/ownership history, include “acquired by” OR “purchased by” OR “owner” and the organization/work name; if a person was featured in unspecified magazines first, query '<Person> featured in OR appeared in magazine' to get the magazine, then acquisition.
  - For canonical properties (member count, founding/charter year vs first season, population 2010, traditional length; screenwriter of a film; composer/librettist of an opera; platforms of a game), query the exact entity page plus property keywords.
- Precision controls:
  - Prefer property words (“cast”, “screenwriter”, “first issue”, “founded”, “owner”, “player count”, “platforms”, “release year”, “birth”) over vague comparison words alone.
  - Avoid adding off-topic brand/manufacturer names unless they are central to the property being queried.
- Geographic/name-change handling: if Next-hop suggests alias or renamed entities (e.g., “The AXIS” now “Zappos Theater”), include both names with OR.
- Avoid special-case code/SQL. Output exactly one plain-text query string only.

Guidelines (examples, do not output these):
- Next-hop: player counts for Game of the Generals and Ludo -> Query: '"Game of the Generals" number of players OR "player count" vs "Ludo" board game players OR "player count"'
- Next-hop: identify '71 actor and Peaky Blinders role -> Query: '"\'71" (2014 film) cast actor "Arthur Shelby" "Peaky Blinders" Paul Anderson'
- Next-hop: actor from Tombstone born May 17, 1955 -> Query: '"Tombstone" (1993 film) cast actor born "May 17, 1955"'
- Next-hop: screenwriter of Dirty Dancing -> Query: '"Dirty Dancing" (1987 film) writer OR screenwriter "written by"'
- Next-hop: founding/first season year of Florida Panthers -> Query: '"Florida Panthers" founded year OR "first season" OR joined NHL 1993'
- Next-hop: platforms of Rise of Mana (other than PS Vita) -> Query: '"Rise of Mana" platforms OR released for iOS OR Android'

summarize_2_prompt = Read {passages_2} in light of {question} and {summary_1}.
- Compile final, question-focused bullets with exact names/dates/titles/numbers. Prefix each bullet with the source page title in brackets and indicate authority (exact entity/list/overview/unrelated/property).
- Resolve ambiguity by preferring the most specific, authoritative passage (exact entity page; official list/roster/cast/award page; property page; or the page for the property itself). Note conflicts briefly and state which source is most direct/specific and why (e.g., main entity page for standard length; film page for screenwriter/cast/role; award page for recipients; official roster/list for members).
- Ensure all constraints from Question constraints are satisfied (timeframe during a player’s tenure; correct medium/franchise; revived/relaunched year; chartered vs first season; correct jurisdiction/council/county; correct film year vs similarly named works).
- For comparisons, explicitly compute which satisfies the comparison and state the deciding facts side-by-side (e.g., player count: 2 vs 4; first issue year 1867 vs 1994 → Harper’s Bazaar first; birth years 1926 vs 1955 → 1926 is older). Add a “Computation” bullet showing the decision.
- For category/slot-fill questions (e.g., “are a breed of what?”, “what kind of work”), include the minimal explicit class stated in the source. Prefer the broader species/category (“Dogs”) over subtypes unless the question asks for a subtype.
- For location “where” questions, return the place (city/region/country), not distances or directions.
- If the first hop lacked a key entity and it still does not appear, add:
  - Remaining gap: <state missing entity/property>.
- For measurements or dates with multiple reported values, prefer the conventional/canonical value from the main entity page unless the question specifies otherwise. For sports teams “founded” with conflicting date vs year (e.g., exact award date vs first season), prefer the simple year on the main team page unless the question asks for the exact date.
- End with one line: Answer candidate(s): <verbatim minimal candidate value(s) complying with constraints>. If truly unresolved or one side of a required comparison is absent, include “Unknown”.
- Produce only the summary.

final_answer_prompt = Given {question}, {summary_1}, {summary_2}, output the minimal, precise answer only.
- Determine the answer type (yes/no, person, date, year, number, title, place, list, category/comparison, nationality vs country). Output exactly the value that matches the question.
  - If the question asks “what nationality,” return the demonym (e.g., Czech). If it asks “from which country,” return the country name (e.g., Czech Republic).
  - If the question asks “a breed of what?” return the broad species/category (e.g., Dogs), not subtypes (e.g., herding dog).
- Output an answer only if the exact value appears explicitly in the summaries (or is a direct comparison between explicit values) and matches all constraints (timeframe, medium/franchise, location/council/county, chartered vs first season, standard vs extended measurement).
- If summaries flag Missing evidence/Remaining gap, or if one side of a required comparison is absent, or if the value is not explicitly supported, output "Unknown".
- When multiple candidates exist, choose the one best supported by the most authoritative, most specific source (per summarize_2). If equal candidates remain for a singular “which/what” question, output "Unknown".
- Preserve names/titles exactly as shown in the selected passage (include nicknames/aliases if present, e.g., Alan "Sparrow" Taylor).
- For “how many” questions, output the numeral only. For comparisons (“which published first/older than/which supports more players”), output the single winning entity/value only.
- For sports teams “founded,” if both a precise charter/award date and a first season year are present, prefer the conventional year from the main team page unless the question requests the exact date.
- Do not include any extra words, punctuation, or explanations.

************* end prompts *************
************* end prompts *************
************* end prompts *************
************* end prompts *************

llm_generate |██████████| 300/300 (100.0%) | ⏳ 00:28<00:00 | 10.62it/s
llm_generate |██████████| 300/300 (100.0%) | ⏳ 00:33<00:00 |  9.02it/s
llm_generate |██████████| 300/300 (100.0%) | ⏳ 00:30<00:00 |  9.79it/s
llm_generate |██████████| 300/300 (100.0%) | ⏳ 00:31<00:00 |  9.40it/s

llm_generate |██████████| 150/150 (100.0%) | ⏳ 12:26<00:00 |  5.90s/it 
llm_generate |██████████| 150/150 (100.0%) | ⏳ 12:26<00:00 |  5.90s/it

Exception in worker on attempt 1: raised APITimeoutError('Request timed out.')
Requeuing...

f1: 0.5420433112884726, em: 0.4533333333333333, prec: 0.5433691678691679, recall: 0.5714175084175086

llm_generate |██████████| 150/150 (100.0%) | ⏳ 12:29<00:00 |  5.00s/it
llm_generate |██████████| 300/300 (100.0%) | ⏳ 00:53<00:00 |  5.62it/s
llm_generate |██████████| 150/150 (100.0%) | ⏳ 00:16<00:00 |  9.08it/s
llm_generate |█████████▉| 149/150 (99.3%) | ⏳ 00:39<00:00 |  1.51it/s

Exception in worker on attempt 1: raised APITimeoutError('Request timed out.')
Requeuing...

llm_generate |██████████| 150/150 (100.0%) | ⏳ 00:15<00:00 |  9.46it/s
llm_generate |██████████| 150/150 (100.0%) | ⏳ 00:21<00:00 |  6.83it/s
llm_generate |██████████| 150/150 (100.0%) | ⏳ 00:15<00:00 |  9.71it/s

f1: 0.642608309990663, em: 0.5666666666666667, prec: 0.6423174603174603, recall: 0.6632777777777777


llm_generate |██████████| 150/150 (100.0%) | ⏳ 02:30<00:00 |  4.98s/it 
llm_generate |██████████| 150/150 (100.0%) | ⏳ 02:30<00:00 |  4.98s/it

Exception in worker on attempt 1: raised APITimeoutError('Request timed out.')
Requeuing...

['question', 'question_id', 'gold_answer', 'supporting_facts', 'query_1', 'passages_1', 'summary_1', 'query_2', 'passages_2', 'summary_2', 'final_answer', 'evals', 'queries']

🔧 Creating batches with 100,000 token limit
📊 Processing 150 examples in 4 batches

You are an expert in multi-hop question-answering and prompt optimization. 
Your job is to improve a question-answering system that operates on the HotpotQA benchmark — a dataset that requires multi-hop reasoning across multiple Wikipedia articles.

====================
🏗️ SYSTEM DESCRIPTION
====================
The system answers complex factual questions by chaining together multiple reasoning and retrieval steps. 
It operates as follows:

1. **Input Question:** The system begins with a natural language question from the HotpotQA dataset.
2. **Query Generation:** The system generates a first query ({query_1}) to retrieve relevant documents from a fixed corpus.
3. **Retrieval (BM25-based Search):**
   The search tool is a static BM25 retriever implemented as:

stemmer = Stemmer.Stemmer("english")
retriever = bm25s.BM25.load("/Users/priyanjindal/prompt-learning/benchmarks/hotpotQA/wiki17_abstracts", corpus_name="wiki17_abstracts_corpus.jsonl", load_corpus=True)
corpus = retriever.corpus

def search(query: str, k: int) -> list[dict]:
tokens = bm25s.tokenize(query, stopwords="en", stemmer=stemmer, show_progress=False)
results, scores = retriever.retrieve(tokens, k=k, n_threads=1, show_progress=False)
formatted_results = []
for doc in results[0]:
text = doc['text']
if " | " not in text:
return []
title, content = text.split(" | ", 1)
formatted_results.append({"title": title, "content": content})
return formatted_results

- This search function retrieves top-{k} Wikipedia abstracts.
- Each document has a `title` (the Wikipedia page title) and `content` (the text of the abstract).
- The retriever is *static* — meaning only the **queries** can be improved, not the retrieval algorithm itself.

4. **Summarization:** The retrieved passages are summarized to highlight key facts ({summary_1}, {summary_2}).
5. **Second-Hop Query Generation:** The system generates a follow-up query ({query_2}) based on the question and first summary to gather additional evidence.
6. **Final Answer Generation:** The model combines all retrieved evidence to produce a final answer ({final_answer}).

====================
🎯 OPTIMIZATION GOAL
====================
Your task is to optimize the *prompts* used in each reasoning component (query generation, summarization, and answer synthesis) so that the system retrieves the correct evidence and produces accurate final answers. 

In particular:
- The search component cannot change — only the *language and structure of the queries* affect retrieval quality.
- Good prompts guide the model to generate **precise, entity-rich, multi-hop-aware queries** that retrieve all *supporting facts*.
- Summaries should **preserve factual links** across hops, not just paraphrase content.
- The final answer prompt should encourage **faithful synthesis** of retrieved information.

====================
📄 YOUR INPUTS
====================
Below is the baseline prompt currently being used, along with example runs and feedback. 
Use these to identify weaknesses and propose improvements.

************* start prompts *************
{baseline_prompt}
************* end prompts *************

************* start example data *************
{examples}
************* end example data *************

HERE ARE SOME ANNOTATIONS THAT MAY BE HELPFUL:
{annotations}

====================
🔧 FINAL INSTRUCTIONS
====================
Iterate on the baseline prompt to produce a **new, improved prompts** that:
- Retains all variable placeholders (e.g., {question}, {query_1}, {summary_1}, etc.).
- Produces clearer, more factually grounded reasoning and retrieval.
- Encourages entity completeness (e.g., names, dates, titles, relations) and multi-hop connections.
- Remains faithful to the output schema and return format from the original prompt.
- Includes short, high-quality few-shot examples or guidelines if relevant.
- Return the prompts in the same formatting that they were given.
Note: Make sure to include the variables from the original prompt, which are wrapped in either single brackets or double brackets (e.g.
{var}). If you fail to include these variables, the LLM will not be able to access the required data.

NEW PROMPTS:

   ✅ Batch 1/4: Optimized

You are an expert in multi-hop question-answering and prompt optimization. 
Your job is to improve a question-answering system that operates on the HotpotQA benchmark — a dataset that requires multi-hop reasoning across multiple Wikipedia articles.

====================
🏗️ SYSTEM DESCRIPTION
====================
The system answers complex factual questions by chaining together multiple reasoning and retrieval steps. 
It operates as follows:

1. **Input Question:** The system begins with a natural language question from the HotpotQA dataset.
2. **Query Generation:** The system generates a first query ({query_1}) to retrieve relevant documents from a fixed corpus.
3. **Retrieval (BM25-based Search):**
   The search tool is a static BM25 retriever implemented as:

stemmer = Stemmer.Stemmer("english")
retriever = bm25s.BM25.load("/Users/priyanjindal/prompt-learning/benchmarks/hotpotQA/wiki17_abstracts", corpus_name="wiki17_abstracts_corpus.jsonl", load_corpus=True)
corpus = retriever.corpus

def search(query: str, k: int) -> list[dict]:
tokens = bm25s.tokenize(query, stopwords="en", stemmer=stemmer, show_progress=False)
results, scores = retriever.retrieve(tokens, k=k, n_threads=1, show_progress=False)
formatted_results = []
for doc in results[0]:
text = doc['text']
if " | " not in text:
return []
title, content = text.split(" | ", 1)
formatted_results.append({"title": title, "content": content})
return formatted_results

- This search function retrieves top-{k} Wikipedia abstracts.
- Each document has a `title` (the Wikipedia page title) and `content` (the text of the abstract).
- The retriever is *static* — meaning only the **queries** can be improved, not the retrieval algorithm itself.

4. **Summarization:** The retrieved passages are summarized to highlight key facts ({summary_1}, {summary_2}).
5. **Second-Hop Query Generation:** The system generates a follow-up query ({query_2}) based on the question and first summary to gather additional evidence.
6. **Final Answer Generation:** The model combines all retrieved evidence to produce a final answer ({final_answer}).

====================
🎯 OPTIMIZATION GOAL
====================
Your task is to optimize the *prompts* used in each reasoning component (query generation, summarization, and answer synthesis) so that the system retrieves the correct evidence and produces accurate final answers. 

In particular:
- The search component cannot change — only the *language and structure of the queries* affect retrieval quality.
- Good prompts guide the model to generate **precise, entity-rich, multi-hop-aware queries** that retrieve all *supporting facts*.
- Summaries should **preserve factual links** across hops, not just paraphrase content.
- The final answer prompt should encourage **faithful synthesis** of retrieved information.

====================
📄 YOUR INPUTS
====================
Below is the baseline prompt currently being used, along with example runs and feedback. 
Use these to identify weaknesses and propose improvements.

************* start prompts *************
{baseline_prompt}
************* end prompts *************

************* start example data *************
{examples}
************* end example data *************

HERE ARE SOME ANNOTATIONS THAT MAY BE HELPFUL:
{annotations}

====================
🔧 FINAL INSTRUCTIONS
====================
Iterate on the baseline prompt to produce a **new, improved prompts** that:
- Retains all variable placeholders (e.g., {question}, {query_1}, {summary_1}, etc.).
- Produces clearer, more factually grounded reasoning and retrieval.
- Encourages entity completeness (e.g., names, dates, titles, relations) and multi-hop connections.
- Remains faithful to the output schema and return format from the original prompt.
- Includes short, high-quality few-shot examples or guidelines if relevant.
- Return the prompts in the same formatting that they were given.
Note: Make sure to include the variables from the original prompt, which are wrapped in either single brackets or double brackets (e.g.
{var}). If you fail to include these variables, the LLM will not be able to access the required data.

NEW PROMPTS:

   ✅ Batch 2/4: Optimized

You are an expert in multi-hop question-answering and prompt optimization. 
Your job is to improve a question-answering system that operates on the HotpotQA benchmark — a dataset that requires multi-hop reasoning across multiple Wikipedia articles.

====================
🏗️ SYSTEM DESCRIPTION
====================
The system answers complex factual questions by chaining together multiple reasoning and retrieval steps. 
It operates as follows:

1. **Input Question:** The system begins with a natural language question from the HotpotQA dataset.
2. **Query Generation:** The system generates a first query ({query_1}) to retrieve relevant documents from a fixed corpus.
3. **Retrieval (BM25-based Search):**
   The search tool is a static BM25 retriever implemented as:

stemmer = Stemmer.Stemmer("english")
retriever = bm25s.BM25.load("/Users/priyanjindal/prompt-learning/benchmarks/hotpotQA/wiki17_abstracts", corpus_name="wiki17_abstracts_corpus.jsonl", load_corpus=True)
corpus = retriever.corpus

def search(query: str, k: int) -> list[dict]:
tokens = bm25s.tokenize(query, stopwords="en", stemmer=stemmer, show_progress=False)
results, scores = retriever.retrieve(tokens, k=k, n_threads=1, show_progress=False)
formatted_results = []
for doc in results[0]:
text = doc['text']
if " | " not in text:
return []
title, content = text.split(" | ", 1)
formatted_results.append({"title": title, "content": content})
return formatted_results

- This search function retrieves top-{k} Wikipedia abstracts.
- Each document has a `title` (the Wikipedia page title) and `content` (the text of the abstract).
- The retriever is *static* — meaning only the **queries** can be improved, not the retrieval algorithm itself.

4. **Summarization:** The retrieved passages are summarized to highlight key facts ({summary_1}, {summary_2}).
5. **Second-Hop Query Generation:** The system generates a follow-up query ({query_2}) based on the question and first summary to gather additional evidence.
6. **Final Answer Generation:** The model combines all retrieved evidence to produce a final answer ({final_answer}).

====================
🎯 OPTIMIZATION GOAL
====================
Your task is to optimize the *prompts* used in each reasoning component (query generation, summarization, and answer synthesis) so that the system retrieves the correct evidence and produces accurate final answers. 

In particular:
- The search component cannot change — only the *language and structure of the queries* affect retrieval quality.
- Good prompts guide the model to generate **precise, entity-rich, multi-hop-aware queries** that retrieve all *supporting facts*.
- Summaries should **preserve factual links** across hops, not just paraphrase content.
- The final answer prompt should encourage **faithful synthesis** of retrieved information.

====================
📄 YOUR INPUTS
====================
Below is the baseline prompt currently being used, along with example runs and feedback. 
Use these to identify weaknesses and propose improvements.

************* start prompts *************
{baseline_prompt}
************* end prompts *************

************* start example data *************
{examples}
************* end example data *************

HERE ARE SOME ANNOTATIONS THAT MAY BE HELPFUL:
{annotations}

====================
🔧 FINAL INSTRUCTIONS
====================
Iterate on the baseline prompt to produce a **new, improved prompts** that:
- Retains all variable placeholders (e.g., {question}, {query_1}, {summary_1}, etc.).
- Produces clearer, more factually grounded reasoning and retrieval.
- Encourages entity completeness (e.g., names, dates, titles, relations) and multi-hop connections.
- Remains faithful to the output schema and return format from the original prompt.
- Includes short, high-quality few-shot examples or guidelines if relevant.
- Return the prompts in the same formatting that they were given.
Note: Make sure to include the variables from the original prompt, which are wrapped in either single brackets or double brackets (e.g.
{var}). If you fail to include these variables, the LLM will not be able to access the required data.

NEW PROMPTS:

   ✅ Batch 3/4: Optimized

You are an expert in multi-hop question-answering and prompt optimization. 
Your job is to improve a question-answering system that operates on the HotpotQA benchmark — a dataset that requires multi-hop reasoning across multiple Wikipedia articles.

====================
🏗️ SYSTEM DESCRIPTION
====================
The system answers complex factual questions by chaining together multiple reasoning and retrieval steps. 
It operates as follows:

1. **Input Question:** The system begins with a natural language question from the HotpotQA dataset.
2. **Query Generation:** The system generates a first query ({query_1}) to retrieve relevant documents from a fixed corpus.
3. **Retrieval (BM25-based Search):**
   The search tool is a static BM25 retriever implemented as:

stemmer = Stemmer.Stemmer("english")
retriever = bm25s.BM25.load("/Users/priyanjindal/prompt-learning/benchmarks/hotpotQA/wiki17_abstracts", corpus_name="wiki17_abstracts_corpus.jsonl", load_corpus=True)
corpus = retriever.corpus

def search(query: str, k: int) -> list[dict]:
tokens = bm25s.tokenize(query, stopwords="en", stemmer=stemmer, show_progress=False)
results, scores = retriever.retrieve(tokens, k=k, n_threads=1, show_progress=False)
formatted_results = []
for doc in results[0]:
text = doc['text']
if " | " not in text:
return []
title, content = text.split(" | ", 1)
formatted_results.append({"title": title, "content": content})
return formatted_results

- This search function retrieves top-{k} Wikipedia abstracts.
- Each document has a `title` (the Wikipedia page title) and `content` (the text of the abstract).
- The retriever is *static* — meaning only the **queries** can be improved, not the retrieval algorithm itself.

4. **Summarization:** The retrieved passages are summarized to highlight key facts ({summary_1}, {summary_2}).
5. **Second-Hop Query Generation:** The system generates a follow-up query ({query_2}) based on the question and first summary to gather additional evidence.
6. **Final Answer Generation:** The model combines all retrieved evidence to produce a final answer ({final_answer}).

====================
🎯 OPTIMIZATION GOAL
====================
Your task is to optimize the *prompts* used in each reasoning component (query generation, summarization, and answer synthesis) so that the system retrieves the correct evidence and produces accurate final answers. 

In particular:
- The search component cannot change — only the *language and structure of the queries* affect retrieval quality.
- Good prompts guide the model to generate **precise, entity-rich, multi-hop-aware queries** that retrieve all *supporting facts*.
- Summaries should **preserve factual links** across hops, not just paraphrase content.
- The final answer prompt should encourage **faithful synthesis** of retrieved information.

====================
📄 YOUR INPUTS
====================
Below is the baseline prompt currently being used, along with example runs and feedback. 
Use these to identify weaknesses and propose improvements.

************* start prompts *************
{baseline_prompt}
************* end prompts *************

************* start example data *************
{examples}
************* end example data *************

HERE ARE SOME ANNOTATIONS THAT MAY BE HELPFUL:
{annotations}

====================
🔧 FINAL INSTRUCTIONS
====================
Iterate on the baseline prompt to produce a **new, improved prompts** that:
- Retains all variable placeholders (e.g., {question}, {query_1}, {summary_1}, etc.).
- Produces clearer, more factually grounded reasoning and retrieval.
- Encourages entity completeness (e.g., names, dates, titles, relations) and multi-hop connections.
- Remains faithful to the output schema and return format from the original prompt.
- Includes short, high-quality few-shot examples or guidelines if relevant.
- Return the prompts in the same formatting that they were given.
Note: Make sure to include the variables from the original prompt, which are wrapped in either single brackets or double brackets (e.g.
{var}). If you fail to include these variables, the LLM will not be able to access the required data.

NEW PROMPTS:

   ✅ Batch 4/4: Optimized
create_query_1_prompt = You are generating a single high-recall, keyword-style search query for a BM25 retriever over Wikipedia abstracts. Use {question}.
- Objective: write one entity-rich query that surfaces (1) all core entities and aliases, (2) the exact pivot pages (persons/films/albums/TV series; list/filmography/discography/roster/awards pages; ownership/parent company/founding/HQ pages), and (3) property keywords needed to answer the question.
- From the question, explicitly extract:
  (a) answer type (person/date/year/number/title/place/list/category/yes-no/comparison),
  (b) all key entities with disambiguators (medium, year, country, league, club/organization),
  (c) relations (cast/role/author/screenwriter/composer/librettist/owner/parent company/founding/founded by/head coach OR manager/HQ city/first issue/publication year/birth year/“revived” OR “relaunched” OR “sequel”/player count/platforms OR systems OR devices),
  (d) constraints (timeframe/season/year; jurisdiction/city/county/state/country; renamed/alias; cross-entity bridging like “the singer of X”, “the actor from Y also appears in Z”; exclusion patterns like “other than/besides/except X” to guide later filtering).
- Disambiguation and aliases:
  - Add medium and year where canonical: "Tombstone" (1993 film), "OK! (magazine)", '"71" (2014 film)'.
  - Include creator/league/club when helpful: 'Panathinaikos F.C. head coach OR manager 2008 OR 2009'.
  - Include renamed venues/entities with OR: 'The AXIS OR "Zappos Theater"'.
  - Include normalized punctuation/diacritics variants with OR: "OK! (magazine)" OR "OK magazine"; "Ain't" OR "Aint"; "Koi... Mil Gaya" OR "Koi Mil Gaya".
- Recall-first multi-hop strategy (compose one rich query):
  - If the question references an entity by role (“singer of X”, “author of Y”, “team that drafted Z”), include both the work and the role: '"Work Hard, Play Harder" singer OR artist' with deciding property words (debut/year) if needed.
  - For comparisons (“which published first/older than/more players”), include all entities plus decisive property keywords (“first issue year”, “publication year”, “player count”, “birth year”).
  - For ownership/HQ questions, include the organization + 'owner OR parent company OR headquarters OR HQ city OR based in'.
  - For coaching/manager changes, include club + 'head coach OR manager' + season/year window.
  - For revival/relaunch, include 'revived OR relaunch OR revival' + year + title names.
  - For actor-bridging into themed works, include cast/filmography + theme terms (alien OR extraterrestrial; vampire; robot).
- Precision controls for BM25:
  - Prefer exact entity pages + property words (“cast”, “writer/screenwriter”, “composer”, “librettist”, “owner”, “parent company”, “founded”, “first issue”, “publication year”, “birth”, “head coach OR manager”, “county”, “headquarters”, “player count”, “platforms” OR “systems” OR “devices”, “release year”).
  - Add list/filmography/discography/roster/awards pages when they are the most likely source of the property.
  - Avoid off-topic brands/platforms unless explicitly relevant.
- Geographic/name disambiguation:
  - If a place/team/council is ambiguous, add qualifiers (city/state/county/league/country) and nearby anchors.
- Temporal scope:
  - If the question asks “how many has team X won”, target all-time totals unless a timeframe is specified (then include the dates).
- Output exactly one plain-text query string only (no trailing punctuation beyond quotes). No code or explanations.

Guidelines (examples, do not output these):
- Q: Which was published first, The Fader or OK!? -> Query: '"The Fader" magazine first issue year OR founding year vs "OK! (magazine)" OR "OK magazine" first issue year'
- Q: Which actor from Tombstone was born on May 17, 1955? -> Query: '"Tombstone" (1993 film) cast actor born "May 17, 1955"'
- Q: The Other Side was released by the band founded in what year? -> Query: '"The Other Side" song OR single artist band founding year OR formation year discography'
- Q: Where did Panathinaikos F.C.'s new coach work before? -> Query: '"Panathinaikos F.C." head coach OR manager appointed OR new coach 2008 OR 2009 previous club OR worked before'
- Q: Which DC title revived in 2007 features Donna Troy? -> Query: '"The Brave and the Bold" (2007 comic) revived OR relaunch characters OR features Donna Troy'
- Q: The Japanese action role-playing game Rise of Mana was developed for PlayStation Vita and what other devices? -> Query: '"Rise of Mana" video game platforms OR systems OR devices "released for" OR "available on" iOS OR Android'


summarize_1_prompt = Read {passages_1} and extract only facts that help answer {question}.
- Produce concise bullets as subject–verb–object triples. Prefix each bullet with the source page title in brackets. Include medium/type and years where relevant.
- Preserve exact titles, names, dates, numbers; include aliases if present (original names, nicknames, former names/renames).
- Identify and record the entity type for each named item (film, TV series, album, song, comic, magazine, opera, video game, person, team/club, council, company, venue). Use this to resolve homonyms.
- Note temporal scope explicitly: whether a fact is all-time, by season/year, or at a point in time.
- Mark authority after each bullet: (exact entity page / official list/award/roster/cast/property page / overview page / unrelated).
- Explicitly note ambiguities, conflicts, multiple candidates, or homonymous titles (song vs album; magazine vs manga; 1983 vs 2009 film) and how they differ (year, medium, country).
- For comparison/ordering questions, list the deciding property for each entity side-by-side.
- If a key pivot or property is missing (e.g., second item in a comparison; the person/owner/publisher/author/composer/librettist/actor; HQ city; coach/manager; revived title), add:
  - Missing evidence: <state the missing entity/page/property we still need>.
- If the question implies constraints (timeframe/season, jurisdiction/city/county/state/country, revived/relaunched year, alias/rename, exclusion patterns like “other than/besides/except X”), add:
  - Question constraints: <state constraints verbatim and list any explicit exclusions to be filtered from the final answer>.
- Add an “Entity map” bullet to link roles across entities when helpful (work -> author/screenwriter/composer/librettist; song -> artist; actor -> role; team -> league; owner -> HQ city; venue -> hotel/casino/parent company; person -> occupation; club -> head coach/manager).
- Prefer explicit phrases from sources over labels when the question asks “believed in what/what kind of/are a breed of what?” (quote the defining phrase if present).
- If a descriptive modifier in the question appears unverified or possibly inaccurate, still record the decisive fact and flag the descriptor as unverified.
- End with two lines:
  - Known entities: <exact titles/names identified so far (include medium/type) and authoritative variants>
  - Next-hop focus: <precise gap to look up next (e.g., "publication years for The Fader and OK! (magazine)", "identify actor from Tombstone born May 17, 1955", "Panathinaikos F.C. current/new head coach name + previous club", "which HNY (2014 film) cast member appears in a film featuring an extraterrestrial creature + name of that creature", "owner and HQ city of Peak FM’s parent (Wireless Group)", "revived DC titles in 2007 including The Brave and the Bold (2007 comic)")>
- Do not speculate or add unrelated background; mark any non-pertinent passage as (unrelated). If no relevant facts exist in the passages, state Missing evidence accordingly.


create_query_2_prompt = Use {question} and {summary_1} to generate a single improved search query.
- Target exactly the gap in Next-hop focus using exact strings from Known entities; include disambiguators (medium/type qualifiers in parentheses), roles (author/screenwriter/cast/composer/librettist/coach/head coach/manager/owner), organizations (parent company/headquarters), and jurisdictions/years from Question constraints.
- Include normalized OR-variants for ambiguous or punctuated names directly in the same query (e.g., '"OK! (magazine)" OR "OK magazine"').
- Calibrate breadth and correct homonyms:
  - If Missing evidence indicates a key pivot entity is absent (author/publisher/owner/parent company/actor/artist/composer/librettist/screenwriter/team name/coach OR manager/HQ city), query directly for that entity and the property (e.g., '"Panathinaikos F.C." head coach OR manager appointed 2008 OR 2009 previous club').
  - If the first hop over-constrained or hit the wrong homonym, broaden or correct with year/medium/country and synonyms; include multiple plausible variants joined by OR (e.g., '"The AXIS" OR "Zappos Theater" location hotel casino owner OR parent company').
  - For comparisons, include all entities plus deciding property keywords (“first issue year”, “published first”, “older than”, “birth year”, “number of players”, “player count”, “composer AND librettist”).
  - For revival/relaunch questions, include 'revived OR relaunch OR revival' + year + title names.
  - For actor-bridging to themed works, include the cast list entity and the theme keywords (e.g., 'alien OR extraterrestrial'), and, if known, likely crossover film titles with OR.
  - For ownership/HQ follow-ups, include owner name + 'headquarters OR based in OR HQ city'.
- Precision controls:
  - Prefer property words (“cast”, “writer/screenwriter”, “composer”, “librettist”, “first issue”, “founded”, “owner”, “parent company”, “headquarters”, “player count”, “platforms” OR “systems” OR “devices”, “platforms list”, “release year”, “birth”, “head coach OR manager”, “county”, “suburb”).
  - Include “list of” or filmography/discography/roster pages if that’s the likely authoritative source.
  - Avoid adding off-topic brands or platforms unless central to the property.
- Temporal scope:
  - If the question is about franchise totals (“how many World Series has the team won”), default to all-time totals unless a timeframe constraint is present; include “all-time” related list pages when helpful.
- Geographic/name-change handling: if Next-hop suggests alias or renamed entities (e.g., “The AXIS” now “Zappos Theater”), include both names with OR.
- Exclusion-awareness: for questions with “other than/besides/except/apart from,” you cannot enforce NOT in BM25; still include the pivot entity and property keywords so the exact entity page surfaces, and rely on summaries/final step to filter the excluded item.
- If Next-hop focus is fully satisfied by {summary_1}, generate a narrow confirmation query to the most authoritative exact entity or property/list page.
- Output exactly one plain-text query string only. No code or explanations.


summarize_2_prompt = Read {passages_2} in light of {question} and {summary_1}.
- Compile final, question-focused bullets with exact names/dates/titles/numbers. Prefix each bullet with the source page title in brackets and indicate authority (exact entity/list/overview/unrelated/property).
- Resolve ambiguity by preferring the most specific, authoritative passage (exact entity page; official list/roster/cast/award page; primary property page). Note conflicts briefly and state which source is most direct/specific and why.
- Ensure all constraints from Question constraints are satisfied (timeframe/season; correct medium/franchise; revived/relaunched year; jurisdiction/council/county; alias/rename handling; standard measurement vs extended variants). If the question uses “other than/besides/except/apart from,” explicitly filter out the excluded item(s) when listing candidates.
- For comparisons, explicitly compute which satisfies the comparison and state the deciding facts side-by-side. Add a “Computation” bullet with the decision.
- For category/slot-fill questions (“are a breed of what?”, “what kind of work”, “either X or what?”), return the minimal explicit class or missing alternative. Prefer the broader class unless a subtype is explicitly requested.
- For location “where” questions, return the place (city/region/country), not distances or directions.
- Prefer explicit defining phrases over labels when the question asks “believed in what/what kind of” (quote the source phrase if present).
- If a first-hop descriptor in the question was unverified, but the decisive identity/property is supported, state the decisive fact and note: Descriptor unverified/not required for answer.
- If a key entity/property remains absent, add:
  - Remaining gap: <state missing entity/property>.
- For totals versus time-bounded counts (e.g., championships), default to all-time totals unless the question limits the timeframe; cite list/record pages if used.
- End with one line: Answer candidate(s): <verbatim minimal candidate value(s) complying with constraints>. If unresolved or one side of a comparison is absent, include “Unknown”.
- Produce only the summary.


final_answer_prompt = Given {question}, {summary_1}, {summary_2}, output the minimal, precise answer only.
- Determine the answer type (yes/no, person, date, year, number, title, place, list, category/comparison/format, nationality vs country).
  - If the question asks “what nationality,” return the demonym (e.g., Czech). If it asks “from which country,” return the country name (e.g., Czech Republic).
  - If it asks “a breed of what?” or “what kind of,” return the broad class/category (e.g., Dogs; magazine; opera), not subtypes.
  - If the question has the pattern “either X or what?” or “X and what other Y?” or contains “other than/besides/except/apart from,” output only the alternative(s) to X and exclude any entity explicitly mentioned in the question.
  - For franchise totals (e.g., titles won), return all-time totals unless a timeframe is explicitly specified.
- Output an answer only if the value appears explicitly in the summaries or is a direct comparison/computation between explicit values and matches all constraints (timeframe, medium/franchise, location/council/county, standard measurement).
- Prefer explicit phrases from sources over labels when the question asks for beliefs or definitions; if both exist, output the explicit defining phrase.
- If summaries flag Missing evidence/Remaining gap, if one side of a required comparison is absent, or if the value is not explicitly supported, output "Unknown".
- When multiple candidates exist, choose the one best supported by the most authoritative, most specific source (per summarize_2). If equal candidates remain for a singular “which/what” question, output "Unknown".
- Preserve names/titles exactly as shown in the selected passage (include nicknames/aliases if present). Preserve numerals and punctuation as in source when possible.
- For “how many” questions, output the numeral only. For comparisons (“which published first/older than/which supports more players/which had more contributors”), output the single winning entity/value only.
- For sports teams “founded,” if both a precise date and a first season year are present, prefer the conventional year from the main team page unless the question requests the exact date.





























































































































llm_generate |██████████| 300/300 (100.0%) | ⏳ 00:28<00:00 | 10.61it/s


































































































































llm_generate |██████████| 300/300 (100.0%) | ⏳ 00:32<00:00 |  9.19it/s






































































































































llm_generate |██████████| 300/300 (100.0%) | ⏳ 00:27<00:00 | 10.72it/s




























































































































llm_generate |██████████| 300/300 (100.0%) | ⏳ 00:31<00:00 |  9.45it/s
































































































































llm_generate |██████████| 300/300 (100.0%) | ⏳ 00:28<00:00 | 10.60it/s

f1: 0.48325742446144165, em: 0.37666666666666665, prec: 0.480700176865618, recall: 0.5227323232323234

llm_generate |██████████| 150/150 (100.0%) | ⏳ 12:45<00:00 |  5.10s/it
llm_generate |██████████| 150/150 (100.0%) | ⏳ 11:06<00:00 |  4.45s/it
llm_generate |██████████| 150/150 (100.0%) | ⏳ 00:16<00:00 |  9.29it/s
llm_generate |██████████| 150/150 (100.0%) | ⏳ 00:23<00:00 |  6.49it/s
llm_generate |██████████| 150/150 (100.0%) | ⏳ 00:16<00:00 |  9.19it/s
llm_generate |██████████| 150/150 (100.0%) | ⏳ 00:25<00:00 |  5.84it/s
llm_generate |██████████| 150/150 (100.0%) | ⏳ 00:15<00:00 |  9.64it/s

f1: 0.5967907647907649, em: 0.5, prec: 0.5957293806411453, recall: 0.6321666666666667

llm_generate |█████████▉| 149/150 (99.3%) | ⏳ 00:50<00:00 |  1.31it/s

Exception in worker on attempt 1: raised APITimeoutError('Request timed out.')
Requeuing...

llm_generate |██████████| 150/150 (100.0%) | ⏳ 00:58<00:00 |  7.55s/it

['question', 'question_id', 'gold_answer', 'supporting_facts', 'query_1', 'passages_1', 'summary_1', 'query_2', 'passages_2', 'summary_2', 'final_answer', 'evals', 'queries']

🔧 Creating batches with 100,000 token limit
📊 Processing 150 examples in 4 batches

You are an expert in multi-hop question-answering and prompt optimization. 
Your job is to improve a question-answering system that operates on the HotpotQA benchmark — a dataset that requires multi-hop reasoning across multiple Wikipedia articles.

====================
🏗️ SYSTEM DESCRIPTION
====================
The system answers complex factual questions by chaining together multiple reasoning and retrieval steps. 
It operates as follows:

1. **Input Question:** The system begins with a natural language question from the HotpotQA dataset.
2. **Query Generation:** The system generates a first query ({query_1}) to retrieve relevant documents from a fixed corpus.
3. **Retrieval (BM25-based Search):**
   The search tool is a static BM25 retriever implemented as:

stemmer = Stemmer.Stemmer("english")
retriever = bm25s.BM25.load("/Users/priyanjindal/prompt-learning/benchmarks/hotpotQA/wiki17_abstracts", corpus_name="wiki17_abstracts_corpus.jsonl", load_corpus=True)
corpus = retriever.corpus

def search(query: str, k: int) -> list[dict]:
tokens = bm25s.tokenize(query, stopwords="en", stemmer=stemmer, show_progress=False)
results, scores = retriever.retrieve(tokens, k=k, n_threads=1, show_progress=False)
formatted_results = []
for doc in results[0]:
text = doc['text']
if " | " not in text:
return []
title, content = text.split(" | ", 1)
formatted_results.append({"title": title, "content": content})
return formatted_results

- This search function retrieves top-{k} Wikipedia abstracts.
- Each document has a `title` (the Wikipedia page title) and `content` (the text of the abstract).
- The retriever is *static* — meaning only the **queries** can be improved, not the retrieval algorithm itself.

4. **Summarization:** The retrieved passages are summarized to highlight key facts ({summary_1}, {summary_2}).
5. **Second-Hop Query Generation:** The system generates a follow-up query ({query_2}) based on the question and first summary to gather additional evidence.
6. **Final Answer Generation:** The model combines all retrieved evidence to produce a final answer ({final_answer}).

====================
🎯 OPTIMIZATION GOAL
====================
Your task is to optimize the *prompts* used in each reasoning component (query generation, summarization, and answer synthesis) so that the system retrieves the correct evidence and produces accurate final answers. 

In particular:
- The search component cannot change — only the *language and structure of the queries* affect retrieval quality.
- Good prompts guide the model to generate **precise, entity-rich, multi-hop-aware queries** that retrieve all *supporting facts*.
- Summaries should **preserve factual links** across hops, not just paraphrase content.
- The final answer prompt should encourage **faithful synthesis** of retrieved information.

====================
📄 YOUR INPUTS
====================
Below is the baseline prompt currently being used, along with example runs and feedback. 
Use these to identify weaknesses and propose improvements.

************* start prompts *************
{baseline_prompt}
************* end prompts *************

************* start example data *************
{examples}
************* end example data *************

HERE ARE SOME ANNOTATIONS THAT MAY BE HELPFUL:
{annotations}

====================
🔧 FINAL INSTRUCTIONS
====================
Iterate on the baseline prompt to produce a **new, improved prompts** that:
- Retains all variable placeholders (e.g., {question}, {query_1}, {summary_1}, etc.).
- Produces clearer, more factually grounded reasoning and retrieval.
- Encourages entity completeness (e.g., names, dates, titles, relations) and multi-hop connections.
- Remains faithful to the output schema and return format from the original prompt.
- Includes short, high-quality few-shot examples or guidelines if relevant.
- Return the prompts in the same formatting that they were given.
Note: Make sure to include the variables from the original prompt, which are wrapped in either single brackets or double brackets (e.g.
{var}). If you fail to include these variables, the LLM will not be able to access the required data.

NEW PROMPTS:

   ✅ Batch 1/4: Optimized

You are an expert in multi-hop question-answering and prompt optimization. 
Your job is to improve a question-answering system that operates on the HotpotQA benchmark — a dataset that requires multi-hop reasoning across multiple Wikipedia articles.

====================
🏗️ SYSTEM DESCRIPTION
====================
The system answers complex factual questions by chaining together multiple reasoning and retrieval steps. 
It operates as follows:

1. **Input Question:** The system begins with a natural language question from the HotpotQA dataset.
2. **Query Generation:** The system generates a first query ({query_1}) to retrieve relevant documents from a fixed corpus.
3. **Retrieval (BM25-based Search):**
   The search tool is a static BM25 retriever implemented as:

stemmer = Stemmer.Stemmer("english")
retriever = bm25s.BM25.load("/Users/priyanjindal/prompt-learning/benchmarks/hotpotQA/wiki17_abstracts", corpus_name="wiki17_abstracts_corpus.jsonl", load_corpus=True)
corpus = retriever.corpus

def search(query: str, k: int) -> list[dict]:
tokens = bm25s.tokenize(query, stopwords="en", stemmer=stemmer, show_progress=False)
results, scores = retriever.retrieve(tokens, k=k, n_threads=1, show_progress=False)
formatted_results = []
for doc in results[0]:
text = doc['text']
if " | " not in text:
return []
title, content = text.split(" | ", 1)
formatted_results.append({"title": title, "content": content})
return formatted_results

- This search function retrieves top-{k} Wikipedia abstracts.
- Each document has a `title` (the Wikipedia page title) and `content` (the text of the abstract).
- The retriever is *static* — meaning only the **queries** can be improved, not the retrieval algorithm itself.

4. **Summarization:** The retrieved passages are summarized to highlight key facts ({summary_1}, {summary_2}).
5. **Second-Hop Query Generation:** The system generates a follow-up query ({query_2}) based on the question and first summary to gather additional evidence.
6. **Final Answer Generation:** The model combines all retrieved evidence to produce a final answer ({final_answer}).

====================
🎯 OPTIMIZATION GOAL
====================
Your task is to optimize the *prompts* used in each reasoning component (query generation, summarization, and answer synthesis) so that the system retrieves the correct evidence and produces accurate final answers. 

In particular:
- The search component cannot change — only the *language and structure of the queries* affect retrieval quality.
- Good prompts guide the model to generate **precise, entity-rich, multi-hop-aware queries** that retrieve all *supporting facts*.
- Summaries should **preserve factual links** across hops, not just paraphrase content.
- The final answer prompt should encourage **faithful synthesis** of retrieved information.

====================
📄 YOUR INPUTS
====================
Below is the baseline prompt currently being used, along with example runs and feedback. 
Use these to identify weaknesses and propose improvements.

************* start prompts *************
{baseline_prompt}
************* end prompts *************

************* start example data *************
{examples}
************* end example data *************

HERE ARE SOME ANNOTATIONS THAT MAY BE HELPFUL:
{annotations}

====================
🔧 FINAL INSTRUCTIONS
====================
Iterate on the baseline prompt to produce a **new, improved prompts** that:
- Retains all variable placeholders (e.g., {question}, {query_1}, {summary_1}, etc.).
- Produces clearer, more factually grounded reasoning and retrieval.
- Encourages entity completeness (e.g., names, dates, titles, relations) and multi-hop connections.
- Remains faithful to the output schema and return format from the original prompt.
- Includes short, high-quality few-shot examples or guidelines if relevant.
- Return the prompts in the same formatting that they were given.
Note: Make sure to include the variables from the original prompt, which are wrapped in either single brackets or double brackets (e.g.
{var}). If you fail to include these variables, the LLM will not be able to access the required data.

NEW PROMPTS:

   ✅ Batch 2/4: Optimized

You are an expert in multi-hop question-answering and prompt optimization. 
Your job is to improve a question-answering system that operates on the HotpotQA benchmark — a dataset that requires multi-hop reasoning across multiple Wikipedia articles.

====================
🏗️ SYSTEM DESCRIPTION
====================
The system answers complex factual questions by chaining together multiple reasoning and retrieval steps. 
It operates as follows:

1. **Input Question:** The system begins with a natural language question from the HotpotQA dataset.
2. **Query Generation:** The system generates a first query ({query_1}) to retrieve relevant documents from a fixed corpus.
3. **Retrieval (BM25-based Search):**
   The search tool is a static BM25 retriever implemented as:

stemmer = Stemmer.Stemmer("english")
retriever = bm25s.BM25.load("/Users/priyanjindal/prompt-learning/benchmarks/hotpotQA/wiki17_abstracts", corpus_name="wiki17_abstracts_corpus.jsonl", load_corpus=True)
corpus = retriever.corpus

def search(query: str, k: int) -> list[dict]:
tokens = bm25s.tokenize(query, stopwords="en", stemmer=stemmer, show_progress=False)
results, scores = retriever.retrieve(tokens, k=k, n_threads=1, show_progress=False)
formatted_results = []
for doc in results[0]:
text = doc['text']
if " | " not in text:
return []
title, content = text.split(" | ", 1)
formatted_results.append({"title": title, "content": content})
return formatted_results

- This search function retrieves top-{k} Wikipedia abstracts.
- Each document has a `title` (the Wikipedia page title) and `content` (the text of the abstract).
- The retriever is *static* — meaning only the **queries** can be improved, not the retrieval algorithm itself.

4. **Summarization:** The retrieved passages are summarized to highlight key facts ({summary_1}, {summary_2}).
5. **Second-Hop Query Generation:** The system generates a follow-up query ({query_2}) based on the question and first summary to gather additional evidence.
6. **Final Answer Generation:** The model combines all retrieved evidence to produce a final answer ({final_answer}).

====================
🎯 OPTIMIZATION GOAL
====================
Your task is to optimize the *prompts* used in each reasoning component (query generation, summarization, and answer synthesis) so that the system retrieves the correct evidence and produces accurate final answers. 

In particular:
- The search component cannot change — only the *language and structure of the queries* affect retrieval quality.
- Good prompts guide the model to generate **precise, entity-rich, multi-hop-aware queries** that retrieve all *supporting facts*.
- Summaries should **preserve factual links** across hops, not just paraphrase content.
- The final answer prompt should encourage **faithful synthesis** of retrieved information.

====================
📄 YOUR INPUTS
====================
Below is the baseline prompt currently being used, along with example runs and feedback. 
Use these to identify weaknesses and propose improvements.

************* start prompts *************
{baseline_prompt}
************* end prompts *************

************* start example data *************
{examples}
************* end example data *************

HERE ARE SOME ANNOTATIONS THAT MAY BE HELPFUL:
{annotations}

====================
🔧 FINAL INSTRUCTIONS
====================
Iterate on the baseline prompt to produce a **new, improved prompts** that:
- Retains all variable placeholders (e.g., {question}, {query_1}, {summary_1}, etc.).
- Produces clearer, more factually grounded reasoning and retrieval.
- Encourages entity completeness (e.g., names, dates, titles, relations) and multi-hop connections.
- Remains faithful to the output schema and return format from the original prompt.
- Includes short, high-quality few-shot examples or guidelines if relevant.
- Return the prompts in the same formatting that they were given.
Note: Make sure to include the variables from the original prompt, which are wrapped in either single brackets or double brackets (e.g.
{var}). If you fail to include these variables, the LLM will not be able to access the required data.

NEW PROMPTS:

   ✅ Batch 3/4: Optimized

You are an expert in multi-hop question-answering and prompt optimization. 
Your job is to improve a question-answering system that operates on the HotpotQA benchmark — a dataset that requires multi-hop reasoning across multiple Wikipedia articles.

====================
🏗️ SYSTEM DESCRIPTION
====================
The system answers complex factual questions by chaining together multiple reasoning and retrieval steps. 
It operates as follows:

1. **Input Question:** The system begins with a natural language question from the HotpotQA dataset.
2. **Query Generation:** The system generates a first query ({query_1}) to retrieve relevant documents from a fixed corpus.
3. **Retrieval (BM25-based Search):**
   The search tool is a static BM25 retriever implemented as:

stemmer = Stemmer.Stemmer("english")
retriever = bm25s.BM25.load("/Users/priyanjindal/prompt-learning/benchmarks/hotpotQA/wiki17_abstracts", corpus_name="wiki17_abstracts_corpus.jsonl", load_corpus=True)
corpus = retriever.corpus

def search(query: str, k: int) -> list[dict]:
tokens = bm25s.tokenize(query, stopwords="en", stemmer=stemmer, show_progress=False)
results, scores = retriever.retrieve(tokens, k=k, n_threads=1, show_progress=False)
formatted_results = []
for doc in results[0]:
text = doc['text']
if " | " not in text:
return []
title, content = text.split(" | ", 1)
formatted_results.append({"title": title, "content": content})
return formatted_results

- This search function retrieves top-{k} Wikipedia abstracts.
- Each document has a `title` (the Wikipedia page title) and `content` (the text of the abstract).
- The retriever is *static* — meaning only the **queries** can be improved, not the retrieval algorithm itself.

4. **Summarization:** The retrieved passages are summarized to highlight key facts ({summary_1}, {summary_2}).
5. **Second-Hop Query Generation:** The system generates a follow-up query ({query_2}) based on the question and first summary to gather additional evidence.
6. **Final Answer Generation:** The model combines all retrieved evidence to produce a final answer ({final_answer}).

====================
🎯 OPTIMIZATION GOAL
====================
Your task is to optimize the *prompts* used in each reasoning component (query generation, summarization, and answer synthesis) so that the system retrieves the correct evidence and produces accurate final answers. 

In particular:
- The search component cannot change — only the *language and structure of the queries* affect retrieval quality.
- Good prompts guide the model to generate **precise, entity-rich, multi-hop-aware queries** that retrieve all *supporting facts*.
- Summaries should **preserve factual links** across hops, not just paraphrase content.
- The final answer prompt should encourage **faithful synthesis** of retrieved information.

====================
📄 YOUR INPUTS
====================
Below is the baseline prompt currently being used, along with example runs and feedback. 
Use these to identify weaknesses and propose improvements.

************* start prompts *************
{baseline_prompt}
************* end prompts *************

************* start example data *************
{examples}
************* end example data *************

HERE ARE SOME ANNOTATIONS THAT MAY BE HELPFUL:
{annotations}

====================
🔧 FINAL INSTRUCTIONS
====================
Iterate on the baseline prompt to produce a **new, improved prompts** that:
- Retains all variable placeholders (e.g., {question}, {query_1}, {summary_1}, etc.).
- Produces clearer, more factually grounded reasoning and retrieval.
- Encourages entity completeness (e.g., names, dates, titles, relations) and multi-hop connections.
- Remains faithful to the output schema and return format from the original prompt.
- Includes short, high-quality few-shot examples or guidelines if relevant.
- Return the prompts in the same formatting that they were given.
Note: Make sure to include the variables from the original prompt, which are wrapped in either single brackets or double brackets (e.g.
{var}). If you fail to include these variables, the LLM will not be able to access the required data.

NEW PROMPTS:

   ✅ Batch 4/4: Optimized
NEW PROMPTS:

        create_query_1_prompt = You are generating a single high-recall, keyword-style search query for a BM25 retriever over Wikipedia abstracts. Use {question}.
- Objective: write one entity-complete query that surfaces (1) all core entities with exact Wikipedia page strings and plausible aliases/variants (add medium/type and year disambiguators), (2) the likely pivot pages needed to bridge hops (author/screenwriter/composer/librettist/owner/parent company/founding/HQ, film→cast/character, team→league/season/titles, station→owner), and (3) decisive property keywords required to answer the question (“birth”, “born”, “birthplace”, “nationality”, “first issue year”, “founded/founded by”, “headquarters/HQ”, “owner/parent company”, “cast/actor/role”, “World Series championships”, “debut year”, “platforms/systems/devices”).
- Before composing the query, internally determine: answer type (yes/no, person, date/year, number, title, place, list, comparison), key entities (with medium/year), relations (author/screenwriter/composer/librettist/owner/parent company/founded/founded by/HQ city/coach OR head coach OR manager/first issue/publication year/birth year/nationality/revival/character), and constraints (timeframe/season/year; jurisdiction/city/county/state/country; renames/aliases; exclusions like “other than/besides/except”).
- Disambiguation and aliases:
  - Add medium/year where canonical: '"Tombstone" (1993 film)', '"OK! (magazine)"', '"71" (2014 film)', '"The Brave and the Bold" (2007 comic)'.
  - Include creator/league/club when helpful: 'Panathinaikos F.C. head coach OR manager 2008 OR 2009'.
  - Include renamed entities with OR: '"The AXIS" OR "Zappos Theater"'.
  - Include punctuation/diacritic variants with OR: '"OK! (magazine)" OR "OK magazine"', '"Who\'s" OR "Whos"', '"Koi... Mil Gaya" OR "Koi Mil Gaya"'.
  - For ambiguous personal names, include role qualifiers with OR: '"Randy Jackson" musician OR producer OR "American Idol" OR bassist', and when known use parenthetical disambiguators (e.g., '"Randy Jackson (musician)"').
- Multi-hop recall strategy (compose one rich query):
  - For title→author/artist: include the work and author/artist pivot keywords: '"A Death in Vienna" novel author OR writer Daniel Silva'.
  - Bridge roles explicitly: if the question references a role (“singer of X”, “actor from Y born on [date]”), include the work + role + property words: '"Tombstone" (1993 film) cast actor born "May 17, 1955"'.
  - For team/person→titles: include person + played for OR team + championships keywords: '"Frank Bushey" baseball player played for team roster OR biography World Series championships OR titles'.
  - For comparisons/ordering (“which published first/older than/more players/formed same year”), include all entities plus deciding property keywords and country qualifiers if disambiguating.
  - For ownership/HQ/private–public: include the organization + 'owner OR parent company OR headquarters OR HQ city OR private OR public'.
  - For nationality/same-country yes/no, include both people + 'born OR birthplace OR birth place OR nationality OR citizen' and common demonyms (e.g., American, British).
  - Numbers-as-titles: include year/medium to resolve bare numerals: '"71" (2014 film) cast'.
- Precision controls for BM25:
  - Put exact entity page strings in quotes; add property words: “cast”, “actor”, “writer/screenwriter”, “composer”, “librettist”, “owner”, “parent company”, “founded”, “headquarters/HQ”, “first issue”, “birth”, “born”, “birthplace”, “nationality”, “head coach OR manager”, “formation year”, “player count”, “platforms OR systems OR devices”, “debut year”.
  - Add authoritative list/property pages when helpful: 'cast list', 'discography', 'filmography', 'list of [team] players', 'list of [team] seasons', 'World Series championships'.
  - Do not use site: filters or NOT/minus operators. Never include AND; use spaces and quoted phrases; use OR only as a literal token to include variants.
- Geographic/name disambiguation: If a place/team/council is ambiguous, add qualifiers (city/state/county/country) and nearby anchors.
- Temporal scope: For totals (“how many has team X won”), default to all-time unless a timeframe is specified; include “all-time” when helpful.
- Output exactly one plain-text query string only (quotes around exact titles, no trailing punctuation beyond quotes). No code or explanations.

Guidelines (examples, do not output these):
- Are Belinda Carlisle and Randy Jackson from the same country? -> '"Belinda Carlisle" born OR birthplace OR nationality American vs "Randy Jackson" musician OR producer born OR birthplace OR nationality'
- Which was published first, The Fader or OK!? -> '"The Fader" magazine "first issue" OR founding publication year US vs "OK! (magazine)" OR "OK magazine" "first issue" OR founding publication year UK'
- Which actor from Tombstone was born on May 17, 1955? -> '"Tombstone" (1993 film) cast actor born "May 17, 1955"'
- ’71 stars an actor who played what character in Peaky Blinders? -> '"71" (2014 film) cast "Peaky Blinders" (TV series) actor role character name'
- How many World Series has the team Frank Bushey played for won? -> '"Frank Bushey" baseball played for team roster OR biography World Series championships OR titles'
- Peak FM is owned and operated by a broadcasting company based in what city? -> '"Peak FM" radio station owner OR parent company "Wireless Group" headquarters OR HQ city Belfast'
- The singer of "Work Hard, Play Harder" made her debut when? -> '"Work Hard, Play Harder" song singer OR artist Gretchen Wilson debut year OR career began'


        summarize_1_prompt = Read {passages_1} and extract only facts that help answer {question}.
- Produce concise bullets as subject–verb–object triples. Prefix each bullet with the source page title in brackets. Include medium/type and years where relevant.
- Preserve exact titles, names, dates, numbers, punctuation; include aliases if present (original names, nicknames, former names/renames).
- Identify and record the entity type for each named item (film, TV series, album, song, comic, magazine, opera, video game, person, team/club, council, company, venue, university type private/public).
- Note temporal scope explicitly: all-time vs by season/year vs point-in-time; include units (mi/km, years, counts).
- Mark authority after each bullet: (exact entity page / official list/award/roster/cast/property page / overview page / unrelated).
- Explicitly record:
  - Pivots discovered (e.g., author/artist/singer; person/owner/parent company; composer/librettist; cast member; team played for; the second entity in a comparison).
  - Negative findings (e.g., “no HQ city stated”, “no singer stated for the song”, “no team identified for the player”).
- Entity consistency check for ambiguous titles:
  - If a title is ambiguous (e.g., “A Death in Vienna”), add a bullet clarifying the correct work+author before using properties.
- For comparison/ordering questions, list the deciding property for each entity side-by-side with values and dates.
- For cloze/relational prompts (“… southwest of ?”), extract both the subject’s location and the referenced anchor place and label which is the answer target.
- If a key pivot or property is missing (e.g., singer/author/owner/parent company/actor/HQ city/coach/private vs public/birthplace/nationality/team played for/World Series titles), add:
  - Missing evidence: <state the missing entity/page/property we still need>.
- If the question implies constraints (timeframe/season, jurisdiction/city/county/state/country, revived/relaunched year, alias/rename, numeric-title disambiguation, exclusions like “other than/besides/except”), add:
  - Question constraints: <state constraints verbatim and list any explicit exclusions to be filtered later>.
- Add:
  - Answer type: <person/date/year/number/title/place/list/category/yes-no/comparison>.
  - Entity map: <link roles across entities (work -> author/singer; film -> cast; person -> team; team -> championships; organization -> owner -> HQ city; person -> nationality/birthplace etc.)>.
- End with two lines:
  - Known entities: <exact titles/names identified so far (include medium/type) and authoritative variants>
  - Next-hop focus: <precise gap to look up next (e.g., "nationality or birthplace for Belinda Carlisle and Randy Jackson", "identify team for Frank Bushey then team World Series championships", "publication years for The Fader and OK! (magazine)", "identify actor from '71' and their Peaky Blinders character", "Wireless Group headquarters city")>
- Do not speculate or add unrelated background; mark any non-pertinent passage as (unrelated). If no relevant facts exist in the passages, state Missing evidence accordingly and make Next-hop focus target the exact entity pages needed.


        create_query_2_prompt = Use {question} and {summary_1} to generate a single improved search query.
- Target exactly the gap in Next-hop focus using exact strings from Known entities; include disambiguators (medium/type qualifiers in parentheses), roles (author/screenwriter/cast/composer/librettist/coach/head coach/manager/owner), organizations (parent company/headquarters), and jurisdictions/years from Question constraints.
- Include normalized OR-variants for ambiguous, misspelled, punctuated, or renamed names directly in the same query (e.g., '"OK! (magazine)" OR "OK magazine"', '"The AXIS" OR "Zappos Theater"').
- Calibrate breadth and fix homonyms:
  - If Missing evidence indicates a pivot entity is absent (author/singer/owner/parent company/actor/artist/composer/librettist/screenwriter/team/coach OR manager/HQ/private OR public/birthplace/nationality), query directly for that entity + the property (e.g., '"Belinda Carlisle" born OR birthplace OR nationality'; '"Frank Bushey" baseball played for team roster OR biography').
  - If the first hop over-constrained or picked the wrong homonym, broaden/correct with year/medium/country and synonyms; include multiple plausible variants joined by OR (e.g., '"Randy Jackson (musician)" OR "Randy Jackson" musician OR producer').
  - For comparisons, include all entities plus deciding property keywords (“first issue year”, “published first”, “older than”, “birth year”, “formation year”, “player count”, “World Series championships”).
  - For revival/relaunch questions, include 'revived OR relaunch OR revival' + year + title names and notable character names.
  - For ownership/HQ/private–public, include the entity name + 'owner OR parent company OR headquarters OR HQ city'.
- Precision controls:
  - Prefer property words: “cast”, “actor”, “writer/screenwriter”, “composer”, “librettist”, “first issue”, “founded”, “owner”, “parent company”, “headquarters/HQ”, “formation year”, “player count”, “platforms OR systems OR devices”, “release year”, “birth”, “born”, “birthplace”, “nationality”, “head coach OR manager”, “World Series championships OR titles”.
  - Include “list of” / filmography / discography / roster pages when authoritative; include “cast list” when targeting film actors; include “list of [team] seasons” or “World Series championships” for MLB titles.
  - Do not use site: filters or NOT/minus operators. Never include AND; use spaces and quoted phrases; use OR only for variants.
- Temporal scope:
  - For totals or histories (e.g., titles won), default to all-time unless limited; include “all-time” when helpful.
- Geographic/name-change handling: if Next-hop suggests alias or renamed entities, include both names with OR.
- Exclusion-awareness: for “other than/besides/except/apart from,” still include the pivot entity and property keywords; filtering happens in later steps.
- If Next-hop focus is fully satisfied by {summary_1}, generate a narrow confirmation query to the most authoritative exact entity or property/list page (e.g., 'cast list', 'infobox', 'publication history').
- Output exactly one plain-text query string only. No code or explanations.

Guidelines (examples, do not output these):
- “A Death in Vienna” novel author and count novels -> '"A Death in Vienna" novel author Daniel Silva bibliography list of novels total'
- Tombstone actor born May 17, 1955 -> '"Tombstone" (1993 film) cast list actor born "May 17, 1955" Bill Paxton birth date'
- The Fader vs OK! first issue -> '"The Fader" magazine "first issue" OR founding year vs "OK! (magazine)" OR "OK magazine" "first issue" OR founding year'
- Singer of "Work Hard, Play Harder" debut -> '"Work Hard, Play Harder" song singer OR artist Gretchen Wilson debut year OR career began'
- Same-country check for Belinda Carlisle and Randy Jackson -> '"Belinda Carlisle" born OR birthplace OR nationality vs "Randy Jackson" musician OR producer born OR birthplace OR nationality'
- Frank Bushey team then titles -> '"Frank Bushey" baseball played for team roster OR biography World Series championships OR titles "Boston Red Sox" OR "Chicago Cubs" (if discovered)'


        summarize_2_prompt = Read {passages_2} in light of {question} and {summary_1}.
- Compile final, question-focused bullets with exact names/dates/titles/numbers. Prefix each bullet with the source page title in brackets and indicate authority (exact entity/list/overview/unrelated/property).
- Resolve ambiguity by preferring the most specific, authoritative passage (exact entity page; official list/roster/cast/award/championship page; primary property page). Note conflicts briefly and state which source is most direct and why.
- Ensure all constraints from Question constraints are satisfied (timeframe/season; medium/franchise; revived/relaunched year; jurisdiction/council/county; alias/rename handling; numeric-title entity year/medium).
- For comparisons, list the deciding facts side-by-side and add a “Computation” bullet stating the decision; for yes/no “same country” comparisons, normalize demonyms to countries and compute.
- Prefer the specific named sub-event/entity over a broader umbrella event when both appear (e.g., choose “Northern Rock crisis” over “late-2000s financial crisis” if the scoop was specifically Northern Rock).
- For category/slot-fill questions, return the minimal explicit class or missing alternative. Prefer broader class unless a subtype is explicitly requested.
- For “where” questions, return the place (city/region/country) only; for cloze locations, return the referenced anchor place.
- If a descriptor in the question was unverified but the decisive identity/property is supported, state the decisive fact and note: Descriptor unverified/not required for answer.
- If a key entity/property remains absent, add:
  - Remaining gap: <state missing entity/property>.
- For totals vs time-bounded counts, default to all-time unless limited; cite list/record pages if used.
- End with one line: Answer candidate(s): <verbatim minimal candidate value(s) complying with constraints or “Unknown” if unresolved>.
- Produce only the summary.


        final_answer_prompt = Given {question}, {summary_1}, {summary_2}, output the minimal, precise answer only.
- Determine the answer type (yes/no, person, date, year, number, title, place, list, category/comparison, nationality vs country, address-only).
  - If “what nationality,” return the demonym (e.g., Czech). If “from which country,” return the country name (e.g., Czech Republic).
  - If “a breed of what?” or “what kind of,” return the broad class/category (e.g., Dogs; magazine; opera), not subtypes.
  - If the question pattern is “either X or what?” or “X and what other Y?” or contains “other than/besides/except/apart from,” output only the alternative(s) to X and exclude any entity explicitly mentioned.
  - For “where” questions, output the place only (city/region/country). For cloze location prompts (“… about 80 km southwest of ?”), output the referenced anchor place.
  - For people with nicknames: if the exact entity page presents a nickname tightly bound to the name (quotes in lead) and it appears in summaries, you may include it; otherwise use the full name.
  - For franchise totals (e.g., titles won), return all-time totals unless a timeframe is specified.
  - When a specific named sub-event/entity and a broader umbrella event both appear, choose the most specific one that directly satisfies the question (e.g., “Northern Rock crisis” for “which financial crisis”).
- Output an answer only if it appears explicitly in the summaries or is a direct comparison/computation from explicit values that match all constraints. Do not rely on external knowledge.
- If summaries flag Missing evidence/Remaining gap, if one side of a required comparison is absent, or if the value is not explicitly supported, output "Unknown".
- When multiple candidates exist, choose the one best supported by the most authoritative, most specific source (per summarize_2). If equal candidates remain for a singular “which/what” question, output "Unknown".
- Preserve names/titles exactly as shown in the selected passage (including punctuation/diacritics). Preserve numerals and units when possible.
- For “how many,” output the numeral only. For comparisons (“which published first/older than/which supports more players/which had more contributors”), output the single winning entity/value only.
- For sports teams “founded,” if both a franchise-award date and an establishment/first season year appear, prefer the conventional year used on the main team page/infobox when the question just says “founded”; otherwise honor the explicitly requested form.
- For yes/no questions, output only "Yes" or "No".
- For list answers, return a minimal comma-separated list without extra words.

llm_generate |██████████| 150/150 (100.0%) | ⏳ 08:24<00:00 |  3.37s/it
llm_generate |██████████| 300/300 (100.0%) | ⏳ 00:27<00:00 | 10.96it/s
llm_generate |██████████| 300/300 (100.0%) | ⏳ 00:36<00:00 |  8.17it/s
llm_generate |██████████| 300/300 (100.0%) | ⏳ 00:28<00:00 | 10.56it/s
llm_generate |██████████| 300/300 (100.0%) | ⏳ 00:31<00:00 |  9.44it/s
llm_generate |███████   | 213/300 (71.0%) | ⏳ 00:19<00:07 | 11.33it/s

Exception in worker on attempt 1: raised APIConnectionError('Connection error.')
Requeuing...

llm_generate |█████████▉| 299/300 (99.7%) | ⏳ 00:26<00:00 | 13.03it/s

f1: 0.5242068887586792, em: 0.41, prec: 0.5213418354521295, recall: 0.5724848484848486