Question Answering Using Embeddings
Question answering using embeddings-based search
GPT excels at answering questions, but only on topics it remembers from its training data. What should you do if you want GPT to answer questions about unfamiliar topics? E.g.,
- Recent events after Sep 2021
- Your non-public documents
- Information from past conversations
- etc.
This notebook demonstrates a two-step Search-Ask method for enabling GPT to answer questions using a library of reference text.
- Search: search your library of text for relevant text sections
- Ask: insert the retrieved text sections into a message to GPT and ask it the question"
Why search is better than fine-tuning
GPT can learn knowledge in two ways:
- Via model weights (i.e., fine-tune the model on a training set)
- Via model inputs (i.e., insert the knowledge into an input message)
Although fine-tuning can feel like the more natural option—training on data is how GPT learned all of its other knowledge, after all—we generally do not recommend it as a way to teach the model knowledge. Fine-tuning is better suited to teaching specialized tasks or styles, and is less reliable for factual recall.
As an analogy, model weights are like long-term memory. When you fine-tune a model, it's like studying for an exam a week away. When the exam arrives, the model may forget details, or misremember facts it never read.
In contrast, message inputs are like short-term memory. When you insert knowledge into a message, it's like taking an exam with open notes. With notes in hand, the model is more likely to arrive at correct answers.
One downside of text search relative to fine-tuning is that each model is limited by a maximum amount of text it can read at once:
| Model | Maximum text length |
|---|---|
gpt-3.5-turbo | 4,096 tokens (~5 pages) |
gpt-4 | 8,192 tokens (~10 pages) |
gpt-4-32k | 32,768 tokens (~40 pages) |
Continuing the analogy, you can think of the model like a student who can only look at a few pages of notes at a time, despite potentially having shelves of textbooks to draw upon.
Therefore, to build a system capable of drawing upon large quantities of text to answer questions, we recommend using a Search-Ask approach. Continuing the analogy, you can think of the model like a student who can only look at a few pages of notes at a time, despite potentially having shelves of textbooks to draw upon.
Therefore, to build a system capable of drawing upon large quantities of text to answer questions, we recommend using a Search-Ask approach.
Search
Text can be searched in many ways. E.g.,
- Lexical-based search
- Graph-based search
- Embedding-based search
This example notebook uses embedding-based search. Embeddings are simple to implement and work especially well with questions, as questions often don't lexically overlap with their answers.
Consider embeddings-only search as a starting point for your own system. Better search systems might combine multiple search methods, along with features like popularity, recency, user history, redundancy with prior search results, click rate data, etc. Q&A retrieval performance may also be improved with techniques like HyDE, in which questions are first transformed into hypothetical answers before being embedded. Similarly, GPT can also potentially improve search results by automatically transforming questions into sets of keywords or search terms.
Full procedure
Specifically, this notebook demonstrates the following procedure:
- Prepare search data (once per document)
- Collect: We'll download a few hundred Wikipedia articles about the 2022 Olympics
- Chunk: Documents are split into short, mostly self-contained sections to be embedded
- Embed: Each section is embedded with the OpenAI API
- Store: Embeddings are saved (for large datasets, use a vector database)
- Search (once per query)
- Given a user question, generate an embedding for the query from the OpenAI API
- Using the embeddings, rank the text sections by relevance to the query
- Ask (once per query)
- Insert the question and the most relevant sections into a message to GPT
- Return GPT's answer
Costs
Because GPT is more expensive than embeddings search, a system with a decent volume of queries will have its costs dominated by step 3.
- For
gpt-3.5-turbousing ~1,000 tokens per query, it costs ~$0.002 per query, or ~500 queries per dollar (as of Apr 2023) - For
gpt-4, again assuming ~1,000 tokens per query, it costs ~$0.03 per query, or ~30 queries per dollar (as of Apr 2023) Of course, exact costs will depend on the system specifics and usage patterns.
Preamble
We'll begin by:
- Importing the necessary libraries
- Selecting models for embeddings search and question answering
Installation
Install the Azure Open AI SDK using the below command.
Run this cell, it will prompt you for the apiKey, endPoint, embeddingDeployment, and chatDeployment
Import namesapaces and create an instance of OpenAiClient using the azureOpenAIEndpoint and the azureOpenAIKey
Motivating example: GPT cannot answer questions about current events
Because the training data for gpt-3.5-turbo and gpt-4 mostly ends in September 2021, the models cannot answer questions about more recent events, such as the 2022 Winter Olympics.
For example, let's try asking 'Which athletes won the gold medal in curling in 2022?':
As an AI language model, I don't have real-time data. However, I can provide you with general information. The gold medalists in curling at the 2022 Winter Olympics will be determined during the event. The winners will be the team that finishes in first place in the men's and women's curling competitions. To find out the specific winners, you can check the official website of the International Olympic Committee or reliable sports news sources.
You can give GPT knowledge about a topic by inserting it into an input message
To help give the model knowledge of curling at the 2022 Winter Olympics, we can copy and paste the top half of a relevant Wikipedia article into our message:
The athletes who won the gold medal in curling at the 2022 Winter Olympics are as follows: , ,Men's Curling: Sweden (Niklas Edin, Oskar Eriksson, Rasmus Wranå, Christoffer Sundgren, Daniel Magnusson) , ,Women's Curling: Great Britain (Eve Muirhead, Vicky Wright, Jennifer Dodds, Hailey Duff, Mili Smith) , ,Mixed Doubles Curling: Italy (Stefania Constantini, Amos Mosaner)
Thanks to the Wikipedia article included in the input message, GPT answers correctly.
In this particular case, GPT was intelligent enough to realize that the original question was underspecified, as there were three curling gold medal events, not just one.
Of course, this example partly relied on human intelligence. We knew the question was about curling, so we inserted a Wikipedia article on curling.
The rest of this notebook shows how to automate this knowledge insertion with embeddings-based search.
1. Prepare search data
To save you the time & expense, we've prepared a pre-embedded dataset of a few hundred Wikipedia articles about the 2022 Winter Olympics. To see how we constructed this dataset, or to modify it yourself, see Embedding Wikipedia articles for search
2. Search
Now we'll define a search function that:
- Takes a user query and a dataframe with text & embedding columns
- Embeds the user query with the OpenAI API
- Uses distance between query embedding and text embeddings to rank the texts
- Returns two lists:
- The top N texts, ranked by relevance
- Their corresponding relevance scores
Let's define an asynchronous method named SearchAsync that takes a query, a collection of knowledge base entries, and an optional result count (defaulting to 5), and returns a collection of search results.
The method starts by making an asynchronous request to an AI service (likely OpenAI) to generate an embedding for the query. The GetEmbeddingsAsync method of the client object is used to make this request. The method takes an instance of EmbeddingsOptions as a parameter, which specifies the deployment of the embedding model and the text to be embedded (in this case, the query). The response from the AI service is then processed to extract the query's embedding.
Next, the method calculates the similarity between the query's embedding and the embeddings of all knowledge base entries using the ScoreBySimilarityTo method. This method likely calculates the cosine similarity, a measure of similarity between two non-zero vectors, between the query's embedding and each entry's embedding. The CosineSimilarityComparer<float[]>(t => t) is used to specify how to calculate the cosine similarity.
The resulting scores are then ordered in descending order, filtered to include only scores greater than 0.8, and the top resultCount scores are selected. This means that the method is returning the top resultCount entries that have a similarity score greater than 0.8 with the query's embedding.
Finally, the method creates a new instance of SearchResult for each selected entry, associating each entry with its similarity score. These instances are returned as the search results.
3.Ask
With the search function above, we can now automatically retrieve relevant knowledge and insert it into messages to GPT.
Below, we define a function AskAsync that:
- Takes a user query
- Searches for text relevant to the query
- Stuffs that text into a message for GPT
- Sends the message to GPT
- Returns GPT's answer
The AskAsync method starts by calling the SearchAsync method with the user's question and a dataset about the 2022 Winter Olympics (olympicsData). The SearchAsync method searches the dataset for relevant information and returns a list of search results.
Next, the method constructs a string articles that contains all the search results. Each search result is formatted as a section of a Wikipedia article. The search results are joined together with newline characters in between.
The method then constructs a userQuestion string that instructs the AI to use the articles to answer the question. If the answer cannot be found in the articles, the AI is instructed to respond with "I could not find an answer."
The userQuestion string is then used to create an instance of ChatCompletionsOptions. This object is used to specify the parameters for a chat completion request to the OpenAI API. The Messages property of the object is set to a list that contains a system message and a user message. The system message instructs the AI that it answers questions about the 2022 Winter Olympics. The user message is the userQuestion string. The Temperature property is set to 0, which means that the AI will generate more deterministic responses. The MaxTokens property is set to 3500, which limits the length of the AI's response. The DeploymentName property is set to chatDeployment, which likely specifies the deployment of the chat model.
The method then makes an asynchronous request to the OpenAI API to get chat completions. The GetChatCompletionsAsync method of the client object is used to make this request. The method takes the ChatCompletionsOptions instance as a parameter.
Finally, the method processes the response from the OpenAI API to extract the AI's answer. The Value.Choices.FirstOrDefault()?.Message?.Content expression is used to get the content of the first choice in the response. The method then returns this answer.
The athletes from Norway won a total of 16 gold medals at the 2022 Winter Olympics.
The 2022 Winter Olympics took place in Beijing, China.