Embedding Wikipedia Articles For Search
azure-openai-samplesBasic_Samplesdotnetembeddingscsharp
Export
Embedding Wikipedia articles for search
This notebook gives an example on how to get embeddings from a large dataset. This notebook shows how we prepared a dataset of Wikipedia articles for search, used in Question_answering_using_embeddings.ipynb.
Installation
Install the Azure Open AI SDK using the below command.
[1]
[2]
Run this cell, it will prompt you for the apiKey, endPoint, and embedding deployment
[3]
Import namesapaces and create an instance of OpenAiClient using the azureOpenAIEndpoint and the azureOpenAIKey
[4]
[5]
1. Collect documents
In this example, we'll download a few hundred Wikipedia articles related to the 2022 Winter Olympics.
[6]
[7]
Total pages: 17
[8]
Next, we'll recursively split long sections into smaller sections. There's no perfect recipe for splitting text into sections. Some tradeoffs include:
- Longer sections may be better for questions that require more context
- Longer sections may be worse for retrieval, as they may have more topics muddled together
- Shorter sections are better for reducing costs (which are proportional to the number of tokens)
- Overlapping sections may help prevent answers from being cut by section boundaries
Here, we'll use a simple approach and limit sections to 1,600 tokens each, recursively halving any sections that are too long.
[ ]
[10]
[11]
[12]
[13]