Azure Labelling Github Issues With Embeddings

Labelling Github Issues With Embeddings

azure-openai-samplesBasic_Samplesdotnetembeddingscsharp

alph-notebooks/azure-openai-samples / Labelling_github_issues_with_embeddings.ipynb

Export

Run Notebooks

Contents

No cells yet

Add cells to see them here

Autmatically labelling Github issues

[1]

[2]

[ ]

[4]

[6]

Access to GitHub

You will need access token with rights to query and update issues.

[8]

[9]

[10]

Using GitHub API token

[11]

[12]

The code below is using the Octokit library, which is a .NET client for interacting with the GitHub API.

The first part of the code is creating a new instance of RepositoryIssueRequest named last6Months. This object is used to specify the parameters for a request to fetch issues from a GitHub repository. In this case, the Filter property is set to IssueFilter.All, which means that the request will return all issues regardless of their state (open, closed, etc.). The Since property is set to a date that is six months prior to the current date (DateTimeOffset.UtcNow.Subtract(TimeSpan.FromDays(30*6))). This means that the request will return only the issues that have been updated in the last six months.

The second part of the code is making an asynchronous request to fetch all issues for a specific repository. The GetAllForRepository method of the Issue class in the gitHubClient object is used to make this request. The org and repoName variables are used to specify the organization and the name of the repository from which to fetch the issues. The method returns a list of all issues in the specified repository. The await keyword is used to wait for the method to complete execution before moving on to the next line of code. This is necessary because the method is asynchronous, meaning it runs in the background and may not complete immediately.

[13]

[14]

[15]

With a foreach loop that iterates over chunks of issues. The Chunk(16) method is used to divide the allIssues collection into smaller collections (or chunks) of 16 issues each. This is done to manage memory usage when processing large collections.

Inside the loop, for each chunk of issues, the code first concatenates the title and body of each issue and truncates the resulting string to a maximum of 8191 tokens using the tokenizer.TruncateByTokenCount(s,8191) method. The resulting strings are then converted to an array.

Next, the code makes an asynchronous request to an AI service (likely OpenAI) to generate embeddings for the text of each issue in the chunk. The GetEmbeddingsAsync method of the openAIClient object is used to make this request. The method takes an instance of EmbeddingsOptions as a parameter, which specifies the deployment of the embedding model and the text to be embedded.

The response from the AI service is then processed to extract the embeddings. The Value.Data property of the response contains the embeddings, which are converted to arrays and stored in the embeddings variable.

Finally, the code creates a new instance of IssueWithEmbedding for each issue in the chunk, associating each issue with its corresponding embedding. These instances are added to the issuesWithEmbeddings collection for further processing.

[16]

Azure.RequestFailedException: Requests to the Embeddings_Create Operation under Azure OpenAI API version 2023-09-01-preview have exceeded call rate limit of your current OpenAI S0 pricing tier. Please go here: https://aka.ms/oai/quotaincrease if you would like to further increase the default rate limit.

Status: 429 (Too Many Requests)

ErrorCode: 429



Content:

{"error":{"code":"429","message": "Requests to the Embeddings_Create Operation under Azure OpenAI API version 2023-09-01-preview have exceeded call rate limit of your current OpenAI S0 pricing tier. Please go here: https://aka.ms/oai/quotaincrease if you would like to further increase the default rate limit."}}



Headers:

x-rate-limit-reset-tokens: REDACTED

x-ms-client-request-id: 6286698b-1edb-47b5-a2f9-fdf0a1e53ed7

apim-request-id: REDACTED

Strict-Transport-Security: REDACTED

X-Content-Type-Options: REDACTED

policy-id: REDACTED

x-ms-region: REDACTED

x-ratelimit-remaining-requests: REDACTED

Date: Wed, 06 Dec 2023 12:49:31 GMT

Content-Length: 312

Content-Type: application/json



   at Azure.Core.HttpPipelineExtensions.ProcessMessageAsync(HttpPipeline pipeline, HttpMessage message, RequestContext requestContext, CancellationToken cancellationToken)

   at Azure.AI.OpenAI.OpenAIClient.GetEmbeddingsAsync(EmbeddingsOptions embeddingsOptions, CancellationToken cancellationToken)

   at Submission#14.<<Initialize>>d__0.MoveNext()

--- End of stack trace from previous location ---

   at Microsoft.CodeAnalysis.Scripting.ScriptExecutionState.RunSubmissionsAsync[TResult](ImmutableArray`1 precedingExecutors, Func`2 currentExecutor, StrongBox`1 exceptionHolderOpt, Func`2 catchExceptionOpt, CancellationToken cancellationToken)

The following cell is filtering the issuesWithEmbeddings collection into two separate lists based on the number of labels each issue has.

The first line of the code is creating a new list named noLabels. This list is populated with the issues from the issuesWithEmbeddings collection that have no labels. This is determined by the lambda expression i => i.Issue.Labels.Count == 0 in the Where method, which checks if the Labels property of the Issue object has a Count of 0.

The second line of the code is creating another list named labelled. This list is populated with the issues from the issuesWithEmbeddings collection that have one or more labels. This is determined by the lambda expression i => i.Issue.Labels.Count > 0 in the Where method, which checks if the Labels property of the Issue object has a Count greater than 0.

In both cases, the ToList method is used to convert the filtered enumerable collections to lists.

[17]

[18]

[19]

[20]

[21]

[22]

Then we suggest labels for GitHub issues based on their embeddings.

The code starts by creating a new dictionary named suggestions. The keys in this dictionary are instances of IssueWithEmbedding and the values are arrays of LabelWithEmbeddings.

Next, the code enters a foreach loop that iterates over each issue in the noLabels list. For each issue, the code calculates the similarity between the issue's embedding and the embeddings of all labels using the ScoreBySimilarityTo method. This method likely calculates the cosine similarity, a measure of similarity between two non-zero vectors, between the issue's embedding and each label's embedding. The CosineSimilarityComparer<float[]>(f => f) is used to specify how to calculate the cosine similarity.

The resulting scores are then ordered in descending order, filtered to include only scores greater than 0.85, and the top 5 scores are selected. This means that the code is suggesting the top 5 labels that have a similarity score greater than 0.85 with the issue's embedding.

Finally, the issue and its suggested labels are added to the suggestions dictionary. The Select(s => s.Key).ToArray() part of the code is used to extract the labels (which are the keys in the score dictionary) and convert them to an array.

[23]