Structured Data Retreival Nltosql
Introduction
This notebook demonstrates retreiving information from SQL Databases using Azure OpenAI. The notebook is a part of the larger work on End-to-End virtual assisstant demo.
Background
We assume the role of a regional sales manager, who is interesed in learning more about Sales of different Surface products, price points etc. Information on sales is available across multiple tables in the SQL Database. In this notebook, we illustrate how to retrieve information available in SQL databases through natural language questioning. Specifically:
- We ask (sales related) questions in natural language
- Pass the question to Azure OpenAI as prompt. The completion we get in return is a SQL query to retrieve the answer
- We run this SQL query on our database to retrieve an answer
- Finally, we pass the question and (SQL retreived answer) to Azure OpenAI. The completion we get in return is the answer in natural language.
For information on retreiving information from unstructured data or for the integrated demo, please refer to other material within the bundle.
C:\Users\mejani\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\compat\_optional.py:138: UserWarning: Pandas requires version '2.7.0' or newer of 'numexpr' (version '2.6.9' currently installed). warnings.warn(msg, UserWarning)
First, we set up Azure OpenAI and SQL Server
Prompt Engineering
Let us look at the prompts to pass to Azure OpenAI for NL to SQL task. You find this prompt in generate_nl_to_sql().
System Prompt
The first step is preparing the system prompt. This sets up the overall tone for GPT to answer Sales question. It also sets informs the bot the kind of questions it is authorized to answer.Following are some characteristics of our system prompt:
-
We first generally describe the bots role (you are a SQL Programmer Assistant)
-
We provide some additional details on the role:
a. Specify the SQL version to be used
b. Insturctions for what to do if a valid query cannot be produced
c. Instructions on what kind of questions to not answer
d. Any other specific instruction.
-
Since we are solving an NL to SQL problem, we need to provide the schema of the Sales database next. We provide column names along with datatypes, and relationship information by declaring primary and foreign keys
-
Lastly, we provide some few shot examples. After experimentation on zero shot, few shot examples are added based on the types of queries we see most errors on. We make sure to pass these examples in the "user" and "assistant" role setup.
User Prompt
The next step is to prepare the user prompt. This is just the NL question to be asked. Make sure to follow some basic guidelines while phrasing questions:
- Be specific in what you are asking
- Make sure to phrase the question simply. Consider paraphrasing or breaking the question down into smaller segments in the question is involved.
Our next prompt is to convert the query retreived answer into better format. You find this prompt in generate_sql_to_nl()
System Prompt
Since this is an easier task, the prompt is relatively simpler.
- Set the overall role for the bot
- Provide detailed instructions on the task to be performed (convert to natural language. Convert table to html table)
- provide guidance on what not to do (do not return markdown format)
User Prompt
The user prompt for this setup is just passing the original natural language question along with the SQL retreived answer.
Examples
In the subsequent cells, we demonstrate the setup in action. We first try out a simple question. Next, we try out a more involved question, which looks at information from multiple tables. Lastly, we see a multi-turn question i.e a question which requires context from the question previously asked.
Handling Multi-turn questions in prompts
In some cases, the user breaks down the question in two multiple parts as the conversation continues. In this case, GPT would require context from the history to be able to comprehend and correctly answer the current question. As a result, we add this extra information to the prompt before sending our current question. Specifically,
- We first append our conversation history to the prompt. We keep the "user", "assistant" format we have been following so far with our prompt. Within "user", we add the qustion asked in "user", and the NL answer recieved in "assistant". Make sure to keep the order of Q&A as it occured in the conversation.
- Finally, we add our current question to the prompt, and pass this newly updated prompt to generate_nl_to_sql()
Example 1: Products - A simple NLtoSQL query
Our user input is the NL question, as an output, Azure OpenAI generates as completion the SQL query we should run on the Sales database to answer this question
SELECT prod_name, price FROM Products;
We run the SQL query on our database. We retrive the information as a pandas dataframe
prod_name price 0 Surface Pro 9 999.99 1 Surface Laptop 5 899.99 2 Surface Slim Pen 2 94.99 3 Surface Laptop SE 299.99 4 Surface Laptop Go 2 599.99 5 Surface Go 3 384.99 6 Surface Studio 2+ 1199.99 7 Surface Pen 99.99 8 Surface Duo 2 1099.99 9 Surface Dock 199.99
We ask OpenAI to convert this into a more "Chat-setting friendly" answer. We pass the question, and the answer to Azure OpenAI and ask it to convert the informtaion to an HTML table. In case of a non table asnwer, We ask Azure OpenAI to convert the answer to natural language format.
<table>
<tr>
<th>Product Name</th>
<th>Price</th>
</tr>
<tr>
<td>Surface Pro 9</td>
<td>$999.99</td>
</tr>
<tr>
<td>Surface Laptop 5</td>
<td>$899.99</td>
</tr>
<tr>
<td>Surface Slim Pen 2</td>
<td>$94.99</td>
</tr>
<tr>
<td>Surface Laptop SE</td>
<td>$299.99</td>
</tr>
<tr>
<td>Surface Laptop Go 2</td>
<td>$599.99</td>
</tr>
<tr>
<td>Surface Go 3</td>
<td>$384.99</td>
</tr>
<tr>
<td>Surface Studio 2+</td>
<td>$1199.99</td>
</tr>
<tr>
<td>Surface Pen</td>
<td>$99.99</td>
</tr>
<tr>
<td>Surface Duo 2</td>
<td>$1099.99</td>
</tr>
<tr>
<td>Surface Dock</td>
<td>$199.99</td>
</tr>
</table>
| Product Name | Price |
|---|---|
| Surface Pro 9 | $999.99 |
| Surface Laptop 5 | $899.99 |
| Surface Slim Pen 2 | $94.99 |
| Surface Laptop SE | $299.99 |
| Surface Laptop Go 2 | $599.99 |
| Surface Go 3 | $384.99 |
| Surface Studio 2+ | $1199.99 |
| Surface Pen | $99.99 |
| Surface Duo 2 | $1099.99 |
| Surface Dock | $199.99 |
Example 2: Sales - Example of Joins
SELECT SUM(Sales_Detail.quantity) AS total_laptops_sold FROM Sales_Detail JOIN Sales ON Sales_Detail.sales_id = Sales.sale_id JOIN Products ON Sales_Detail.prod_id = Products.prod_id WHERE Products.category = 'laptop' AND MONTH(Sales.date) = 12;
total_laptops_sold 0 534931
In December, a total of 534,931 laptops were sold.
Example 3: Stock - Example of Multi Turn questions
SELECT SUM(stock) as total_stock FROM Stock WHERE prod_id IN (SELECT prod_id FROM Products WHERE prod_name LIKE '%Surface Laptop 5%');
total_stock 0 2527
We are currently carrying a total stock of 2,527 units of Surface Laptop 5.
[{'role': 'user',
, 'content': 'How much stock of Surface Laptop 5 are we currently carrying?'},
, {'role': 'system',
, 'content': 'We are currently carrying a total stock of 2,527 units of Surface Laptop 5.'}] SELECT SUM(stock) as total_stock FROM Stock WHERE prod_id IN (SELECT prod_id FROM Products WHERE prod_name LIKE '%Surface Laptop 5%') AND merchant_id IN (SELECT merchant_id FROM Merchants WHERE merchant_name LIKE '%GadgetWorld%');
total_stock 0 195
There are 195 items in total stock at GadgetWorld.