Notebooks
A
Azure
Embedding Wikipedia Articles For Search

Embedding Wikipedia Articles For Search

azure-openai-samplesBasic_Samplesdotnetembeddingscsharp

Embedding Wikipedia articles for search

This notebook gives an example on how to get embeddings from a large dataset. This notebook shows how we prepared a dataset of Wikipedia articles for search, used in Question_answering_using_embeddings.ipynb.

Installation

Install the Azure Open AI SDK using the below command.

[1]
[2]

Run this cell, it will prompt you for the apiKey, endPoint, and embedding deployment

[3]

Import namesapaces and create an instance of OpenAiClient using the azureOpenAIEndpoint and the azureOpenAIKey

[4]
[5]

1. Collect documents

In this example, we'll download a few hundred Wikipedia articles related to the 2022 Winter Olympics.

[6]
[7]
Total pages: 17
[8]

Next, we'll recursively split long sections into smaller sections. There's no perfect recipe for splitting text into sections. Some tradeoffs include:

  • Longer sections may be better for questions that require more context
  • Longer sections may be worse for retrieval, as they may have more topics muddled together
  • Shorter sections are better for reducing costs (which are proportional to the number of tokens)
  • Overlapping sections may help prevent answers from being cut by section boundaries

Here, we'll use a simple approach and limit sections to 1,600 tokens each, recursively halving any sections that are too long.

[ ]
[10]
[11]
[12]
[13]