Azure Embedding Wikipedia Articles For Search

Embedding Wikipedia Articles For Search

azure-openai-samplesBasic_Samplesdotnetembeddingscsharp

alph-notebooks/azure-openai-samples / Embedding_Wikipedia_articles_for_search.ipynb

Export

Run Notebooks

Contents

No cells yet

Add cells to see them here

Embedding Wikipedia articles for search

This notebook gives an example on how to get embeddings from a large dataset. This notebook shows how we prepared a dataset of Wikipedia articles for search, used in Question_answering_using_embeddings.ipynb.

Installation

Install the Azure Open AI SDK using the below command.

[1]

[2]

Run this cell, it will prompt you for the apiKey, endPoint, and embedding deployment

[3]

Import namesapaces and create an instance of `OpenAiClient` using the `azureOpenAIEndpoint` and the `azureOpenAIKey`

[4]

[5]

1. Collect documents

In this example, we'll download a few hundred Wikipedia articles related to the 2022 Winter Olympics.

[6]

[7]

Total pages: 17

[8]

Next, we'll recursively split long sections into smaller sections. There's no perfect recipe for splitting text into sections. Some tradeoffs include:

Longer sections may be better for questions that require more context
Longer sections may be worse for retrieval, as they may have more topics muddled together
Shorter sections are better for reducing costs (which are proportional to the number of tokens)
Overlapping sections may help prevent answers from being cut by section boundaries

Here, we'll use a simple approach and limit sections to 1,600 tokens each, recursively halving any sections that are too long.

[ ]

[10]

[11]

[12]

[13]