Notebooks
O
OpenAI
Visualizing Embeddings In Kangas

Visualizing Embeddings In Kangas

chatgptopenaigpt-4third_partyexamplesopenai-apiopenai-cookbook

Visualizing the embeddings in Kangas

In this Jupyter Notebook, we construct a Kangas DataGrid containing the data and projections of the embeddings into 2 dimensions.

What is Kangas?

Kangas as an open source, mixed-media, dataframe-like tool for data scientists. It was developed by Comet, a company designed to help reduce the friction of moving models into production.

1. Setup

To get started, we pip install kangas, and import it.

[1]
[2]

2. Constructing a Kangas DataGrid

We create a Kangas Datagrid with the original data and the embeddings. The data is composed of a rows of reviews, and the embeddings are composed of 1536 floating-point values. In this example, we get the data directly from github, in case you aren't running this notebook inside OpenAI's repo.

We use Kangas to read the CSV file into a DataGrid for further processing.

[3]
Loading CSV file 'fine_food_reviews_with_embeddings_1k.csv'...
1001it [00:00, 2412.90it/s]
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1000/1000 [00:00<00:00, 2899.16it/s]

We can review the fields of the CSV file:

[4]
DataGrid (in memory)
    Name   : fine_food_reviews_with_embeddings_1k
    Rows   : 1,000
    Columns: 9
#   Column                Non-Null Count DataGrid Type       
--- -------------------- --------------- --------------------
1   Column 1                       1,000 INTEGER             
2   ProductId                      1,000 TEXT                
3   UserId                         1,000 TEXT                
4   Score                          1,000 INTEGER             
5   Summary                        1,000 TEXT                
6   Text                           1,000 TEXT                
7   combined                       1,000 TEXT                
8   n_tokens                       1,000 INTEGER             
9   embedding                      1,000 TEXT                

And get a glimpse of the first and last rows:

[5]

Now, we create a new DataGrid, converting the numbers into an Embedding:

[8]

The new DataGrid now has an Embedding column with proper datatype.

[9]
DataGrid (in memory)
    Name   : openai_embeddings
    Rows   : 1,000
    Columns: 9
#   Column                Non-Null Count DataGrid Type       
--- -------------------- --------------- --------------------
1   Column 1                       1,000 INTEGER             
2   ProductId                      1,000 TEXT                
3   UserId                         1,000 TEXT                
4   Score                          1,000 TEXT                
5   Summary                        1,000 TEXT                
6   Text                           1,000 TEXT                
7   combined                       1,000 TEXT                
8   n_tokens                       1,000 INTEGER             
9   embedding                      1,000 EMBEDDING-ASSET     

We simply save the datagrid, and we're done.

[ ]

3. Render 2D Projections

To render the data directly in the notebook, simply show it. Note that each row contains an embedding projection.

Scroll to far right to see embeddings projection per row.

The color of the point in projection space represents the Score.

[11]

Group by "Score" to see rows of each group.

[22]