OpenAI Getting Started With Bigquery Vector Search And Openai

Getting Started With Bigquery Vector Search And Openai

chatgptgcprag-quickstartopenaigpt-4examplesopenai-apiopenai-cookbook

alph-notebooks/openai-cookbook / Getting_started_with_bigquery_vector_search_and_openai.ipynb

Export

Run Notebooks

Contents

No cells yet

Add cells to see them here

GCP Bigquery with GCP Functions and GPT actions in ChatGPT

This notebook provides step-by-step instructions on using Google Cloud BigQuery as a database with vector search capabilities, with OpenAI embeddings, then creating a Google Cloud Function on top to plug into a Custom GPT in ChatGPT.

This can be a solution for customers looking to set up RAG infrastructure contained within Google Cloud Platform (GCP), and exposing it as an endpoint to integrate that with other platforms such as ChatGPT.

Google Cloud BigQuery is a fully-managed, serverless data warehouse that enables super-fast SQL queries using the processing power of Google's infrastructure. It allows developers to store and analyze massive datasets with ease.

Google Cloud Functions is a lightweight, event-based, asynchronous compute solution that allows you to create small, single-purpose functions that respond to cloud events without managing servers or runtime environments.

Pre-requisites:

To run this cookbook, you must have:

A GCP project you have access to
GCP user with permission to create a BigQuery dataset and Google Cloud Function
GCP CLI installed and connected
OpenAI API key
ChatGPT Plus, Teams or Enterprise subscription

Architecture

Below is a diagram of the architecture of this solution, which we'll walk through step-by-step:

Setup of Environment Setup environment by installing and importing the required libraries and configuring our GCP settings. Includes:
Prepare Data Prepare the data for uploading by embedding the documents, as well as capturing additional metadata. We will use a subset of OpenAI's docs as example data for this.
Create BigQuery Table with Vector search
Create a BigQuery table and upload the data we've prepared. Includes:
- Create Dataset: Steps to create a dataset in BigQuery.
- Create Table and upload data: Instructions to create a table in BigQuery.
Create GCP Function using gcloud CLI and environment variables computed previously
Input in a Custom GPT in ChatGPT Perform searches on the embedded data in BigQuery:
- Vector Search: Steps to perform vector-based search queries.
- Metadata filtering Search: Instructions for performing metadata filtering.

Set up environment

Install and import required libraries

The below libraries can be categorized as standard Python libraries, third-party libraries, and GCP-related libraries.

[ ]

Configure GCP project

If not already set-up, we'll install GCP CLI's, authenticate to GCP and set your default project.

[ ]

Configure OpenAI settings

This section guides you through setting up authentication for OpenAI. Before going through this section, make sure you have your OpenAI API key.

[ ]

Configure GCP BigQuery with Vector Search capabilities

This section explains how to create a dataset in BigQuery and store vectors of float, used for embeddings & vector search.

[ ]

Prepare data

We're going to embed and store a few pages of the OpenAI docs in the oai_docs folder. We'll first embed each, add it to a CSV, and then use that CSV to upload to the index.

We are going to use some techniques highlighted in this cookbook. This is a quick way to embed text, without taking into account variables like sections, using our vision model to describe images/graphs/diagrams, overlapping text between chunks for longer documents, etc.

In order to handle longer text files beyond the context of 8191 tokens, we can either use the chunk embeddings separately, or combine them in some way, such as averaging (weighted by the size of each chunk).

We will take a function from Python's own cookbook that breaks up a sequence into chunks.

[ ]

Now we define a function that encodes a string into tokens and then breaks it up into chunks. We'll use tiktoken, a fast open-source tokenizer by OpenAI.

To read more about counting tokens with Tiktoken, check out this cookbook.

[ ]

Finally, we can write a function that safely handles embedding requests, even when the input text is longer than the maximum context length, by chunking the input tokens and embedding each chunk individually. The average flag can be set to True to return the weighted average of the chunk embeddings, or False to simply return the unmodified list of chunk embeddings.

Note: there are other techniques you can take here, including:

using GPT-4o to capture images/chart descriptions for embedding

chunking based on paragraphs or sections

adding more descriptive metadata about each article.

[ ]

Next, we can define a helper function that will capture additional metadata about the documents. In this example, I'll choose from a list of categories to use later on in a metadata filter

[ ]

Now, we can define some helper functions to process the .txt files in the oai_docs folder. Feel free to use this on your own data, this supports both .txt and .pdf files.

[ ]

We'll now use this helper function to process our OpenAI documentation. Feel free to update this to use your own data by changing the folder in process_files below.

Note that this will process the documents in chosen folder concurrently, so this should take <30 seconds if using txt files, and slightly longer if using PDFs.

[ ]

We now have an embedded_data.csv file with six columns that we can upload to our vector database!

Create BigQuery table with Vector Search

Create BigQuery dataset

We'll leverage Google SDK and create a dataset named "oai_docs" with a table name of "embedded_data", but feel free to change those variables (you can also change regions).

PS: We won't create a BigQuery index, that could improve the performance of the vector search, because such index requires more than 1k rows in our dataset which we don't have in our example, but feel free to leverage that for your own use-case.

[ ]

Creating table and upload data

We'll create the table with the attribute name and types. Note the 'content_vector' attribute that allows to store a vector of float for a single row, which we'll use for our vector search.

This code will then loop on our CSVs previously created to insert the rows into Bigquery. If you run this code multiple time, multiple identical rows will be inserted which will give less accurate results when doing search (you could put uniqueness on IDs or clean the DB each time).

[ ]

Test search

Now that the data is uploaded, we'll test both pure vector similarity search and with metadata filtering locally below to make sure it is working as expected.

You can test both a pure vector search and metadata filtering.

The query below is pure vector search, where we don't filter out on category.

[ ]

Perform search with metadata filtering

Metadata filtering allows to restrict findings that have certain attributes on top of having the closest semantic findings of vector search.

The provided code snippet demonstrates how to execute a query with metadata filtering:

[ ]

Create GCP function

Exporting variables

We'll deploy the function in main.py in this folder (also available here).

In a first step, we'll export the variables to target our table/dataset as well as to generate Embeddings using OpenAI's API.

[ ]

Deploying the function

We will now create a google function called "openai_docs_search" for our current project, for that we'll launch the CLI command below, leveraging the previously created environment variables. Note that this function can be called from everywhere without authentication, do not use that for production or add additional authentication mechanism.

[ ]

Input in a Custom GPT in ChatGPT

Now that we have a GCP Function that queries this Vector Search Index, let's put it as a GPT Action!

See documentation here on GPTs and here on GPT Actions. Use the below as the instructions for the GPT and as the OpenAPI spec for the GPT Action.

Create OpenAPI Spec

Below is a sample OpenAPI spec. When we run the block below, a functional spec should be copied to the clipboard to paste in the GPT Action.

Note that this does not have any authentication by default, but you can set up GCP Functions with Auth by following GCP's docs here.

[ ]

Create GPT Instructions

Feel free to modify instructions as you see fit. Check out our docs here for some tips on prompt engineering.

[ ]

Recap

We've now succesfully integrated GCP BigQuery Vector Search with GPT Actions in ChatGPT by doing the following:

Embedded docs using OpenAI's embeddings, while adding some additional metadata using gpt-4o.
Uploaded that data to GCP BigQuery (raw data and vectors of embeddings)
Created an endpoint on GCP Functions to retrieve those
Incorporated it into a custom GPT.

Our GPT can now retrieve informaiton to help answer user queries, making it much more accurate and customized to our data. Here's the GPT in action: