Snowflake End To End ML With Feature Store And Model Registry

End To End ML With Feature Store And Model Registry

End-to-end ML with Feature Store and Model Registrydata-sciencenotebookmachine-learningsnowflake-demo-notebooksdata-engineeringPythonsql

alph-notebooks/snowflake-demo-notebooks / End-to-end ML with Feature Store and Model Registry.ipynb

Export

Run Notebooks

Contents

No cells yet

Add cells to see them here

Required snowflake-ml-python version: >=1.6.1
Last updated on: 8/26/2024

End to end ML with Feature Store and Model Registry

This notebook demonstrates an end-to-end ML experiment cycle including feature creation, training data generation, model training and inference. The workflow touches on key Snowflake ML features including Snowflake Feature Store, Dataset, ML Lineage, Snowpark ML Modeling and Snowflake Model Registry.

Note: there may be a delay in the availability of the newest snowflake-ml-python package in the Snowflake Conda channel. To install the latest snowflake-ml-python package which includes all of necessary components used in this notebook, please follow the install instructions here.

Table of contents

Set up test environment
- Connect to Snowflake
- Select your example
Create features with Feature Store
Generate Training Data
Train model with Snowpark ML
Log models in Model Registry
- Examine model in Snowflake UI
Query lineage (Preview Feature)
Predict with model
- Predict with local model
- Predict with Model Registry
Clean up notebook

Set up test environment

Connect to Snowflake

Let's start with setting up our test environment. We will create a session and a schema. The schema FS_DEMO_SCHEMA will be used as the Feature Store. It will be cleaned up at the end of the demo. You need to fill the connection_parameters with your Snowflake connection information. Follow this guide for more details about how to connect to Snowflake.

[1]

[2]

[Row(status='Schema SNOWFLAKE_FEATURE_STORE_NOTEBOOK_DEMO successfully created.')]

Select your example

We have prepared some examples that you can find in our open source repo. Each example contains the source dataset, feature view and entity definitions which will be used in this demo. ExampleHelper (included in snowflake-ml-python) will setup everything with simple APIs and you don't have to worry about the details.

[3]

load_example() will load the source data into Snowflake tables. In the example below, we are using the “new_york_taxi_features” example. You can replace this with any example listed above. Execution of the cell below may take some time depending on the size of the dataset.

[4]

"AIRLINE_FEATURE_STORE".SNOWFLAKE_FEATURE_STORE_NOTEBOOK_DEMO.nyc_yellow_trips:

Create features with Feature Store

Initialize Feature Store

Let's first create a feature store client. With CREATE_IF_NOT_EXIST mode, it will try to create a new Feature Store schema and all necessary feature store metadata if it doesn't exist already. It is required for the first time to set up a Feature Store. Afterwards, you can use FAIL_IF_NOT_EXIST mode to connect to an existing Feature Store.

Note that the database being used must already exist. Feature Store will NOT try to create the database even in CREATE_IF_NOT_EXIST mode.

[5]

Register entities and feature views

Next we register new entities and feature views in Feature Store. Entities will be the join keys used to generate training data. Feature Views contains all the features you need for your model training and inference. We have entities and feature views for this example defined in our open source repo. We will load the definitions with load_entities() and load_draft_feature_views() for simplicity.

[6]

----------------------------------------------------------------------
|"NAME"        |"JOIN_KEYS"       |"DESC"                 |"OWNER"   |
----------------------------------------------------------------------
|DOLOCATIONID  |["DOLOCATIONID"]  |Drop off location id.  |ENGINEER  |
|TRIP_ID       |["TRIP_ID"]       |Trip id.               |ENGINEER  |
----------------------------------------------------------------------

[7]

------------------------------------------------------------------------------------------------
|"NAME"      |"VERSION"  |"DESC"                                              |"REFRESH_FREQ"  |
------------------------------------------------------------------------------------------------
|F_LOCATION  |1.0        |Features aggregated by location id and refreshe...  |12 hours        |
|F_TRIP      |1.0        |Features per trip refreshed every day.              |1 day           |
------------------------------------------------------------------------------------------------

We can examine all features in a feature view.

[8]

F_LOCATION/1.0 has features:

F_TRIP/1.0 has features:

Examine features in Snowflake UI

Now you should be able to see registered entities and feature views in Snowflake UI.

Generate Training Data

After our feature pipelines are fully setup, we can use them to generate Snowflake Dataset and later do model training. Generating training data is easy since materialized FeatureViews already carry most of the metadata like join keys, timestamp for point-in-time lookup, etc. We just need to provide the spine data (it's called spine because it is the list of entity IDs that we are essentially enriching by joining features with it).

generate_dataset() returns a Snowflake Dataset object, which is best for distributed training with deep learning frameworks like TensorFlow or Pytorch which requires fine-grained file-level access. It creates a new Dataset object (which is versioned and immutable) in Snowflake which materializes the data in Parquet files. If you train models with classic ML libraries like Snowpark ML or scikit-learn, you can use generate_training_set() which returns a classic Snowflake table. The Cell below demonstrates generate_dataset().

Retrieve some metadata columns that are essential when generating training data.

[9]

timestamp col: TPEP_PICKUP_DATETIME
excluded cols: []
label cols: ['TOTAL_AMOUNT']
join keys: ['TRIP_ID', 'DOLOCATIONID']
training spine table: "AIRLINE_FEATURE_STORE".SNOWFLAKE_FEATURE_STORE_NOTEBOOK_DEMO.nyc_yellow_trips

Create a spine dataframe that's sampled from source table.

[10]

------------------------------------------------------------------------
|"TOTAL_AMOUNT"  |"TRIP_ID"  |"DOLOCATIONID"  |"TPEP_PICKUP_DATETIME"  |
------------------------------------------------------------------------
|11.8            |4391772    |236             |2016-01-13 15:28:31     |
|6.8             |9640580    |231             |2016-01-28 21:47:03     |
|10.3            |8986296    |162             |2016-01-27 06:44:50     |
|20.35           |4689446    |261             |2016-01-14 09:29:27     |
|19.89           |9360850    |166             |2016-01-28 07:33:07     |
|6.3             |9335036    |211             |2016-01-28 04:46:46     |
|72.92           |5223446    |264             |2016-01-15 17:21:27     |
|16.3            |4578405    |116             |2016-01-13 23:35:00     |
|7.3             |5045083    |163             |2016-01-15 07:10:06     |
|10.3            |9733135    |145             |2016-01-29 05:14:06     |
------------------------------------------------------------------------

Generate dataset object from spine dataframe and feature views.

[11]

Convert dataset to a snowpark dataframe and examine all the features in it.

[12]

Train model with Snowpark ML

Now let's train a simple random forest model, and evaluate the prediction accuracy. When you call fit() on a DataFrame that is created from a Dataset, the linkage between the trained model and dataset is automatically wired up. Later, you can easily retrieve the training dataset from this model, or you can query the lineage about the dataset and model. This is work-in-progress and will be available soon in an upcoming release.

[13]

feature cols: ['TRIP_DISTANCE', 'FARE_AMOUNT', 'AVG_FARE_10H', 'PASSENGER_COUNT', 'AVG_FARE_1H']
MSE: 8.587654420611477, Accuracy: 99.83667856616516

Log model in Model Registry

After the model is trained, we can save the model into Model Registry so we can manage the model, its metadata including metrics, versions, and use it later for inference. Also, ML lineage is built automatically between the model, dataset and feature views.

[14]

Log model into Model Registry.

[15]

/Users/wezhou/miniconda3/envs/py38/lib/python3.8/contextlib.py:113: UserWarning: `relax_version` is not set and therefore defaulted to True. Dependency version constraints relaxed from ==x.y.z to >=x.y, <(x+1). To use specific dependency versions for compatibility, reproducibility, etc., set `options={'relax_version': False}` when logging the model.
  return next(self.gen)

ModelVersion(
,  name='MY_RANDOM_FOREST_REGRESSOR_MODEL',
,  version='V1',
,)

Examine model in Snowflake UI

Now you should be able to see the model in Snowflake UI.

Query lineage (Preview Feature)

We can now query the lineage from an object. You can call lineage() on any object and it returns a set of objects that it has dependency with.

[16]

WARNING:snowflake.snowpark:LineageNode.lineage() is in private preview since 1.5.3. Do not use it in production. 
WARNING:snowflake.snowpark:Lineage.trace() is in private preview since 1.16.0. Do not use it in production.

[Dataset(
,   name='AIRLINE_FEATURE_STORE.SNOWFLAKE_FEATURE_STORE_NOTEBOOK_DEMO.MY_COOL_TRAINING_DATASET',
,   version='4.0',
, )]

[17]

[ModelVersion(
,   name='MY_RANDOM_FOREST_REGRESSOR_MODEL',
,   version='V1',
, )]

[18]

[Dataset(
  name='AIRLINE_FEATURE_STORE.SNOWFLAKE_FEATURE_STORE_NOTEBOOK_DEMO.MY_COOL_TRAINING_DATASET',
  version='4.0',
)]
[Dataset(
  name='AIRLINE_FEATURE_STORE.SNOWFLAKE_FEATURE_STORE_NOTEBOOK_DEMO.MY_COOL_TRAINING_DATASET',
  version='4.0',
)]

Predict with model

Finally we are almost ready for prediction! For this, we can look up the latest feature values from Feature Store for the specific data records that we are running prediction on. One of the key benefits of using the Feature Store is that it provides a way to automatically serve up the right feature values during prediction with point-in-time correct feature values. load_feature_views_from_dataset() gets the same feature views used in training, then retrieve_feature_values() lookups the latest feature values.

[19]

[Optional 1] predict with local model

Now we can predict with a local model and the feature values retrieved from feature store.

[20]

   TOTAL_AMOUNT TPEP_PICKUP_DATETIME  AVG_FARE_1H  AVG_FARE_10H  \
0         15.96  2016-01-07 10:26:02     9.440415      9.324965   
1         10.55  2016-01-01 18:44:40    10.083333      9.236685   
2         17.80  2016-01-29 21:05:54    10.385390     10.287410   

   PASSENGER_COUNT  TRIP_DISTANCE  FARE_AMOUNT  OUTPUT_TOTAL_AMOUNT  
0                1           2.23         12.5            16.440312  
1                1           1.70          7.0             8.523669  
2                1           3.11         16.5            18.717726

[Option 2] Predict with Model Registry

We can also retrieve the model from model registry and run predictions on the model using latest feature values.

[21]

   TRIP_DISTANCE  FARE_AMOUNT  AVG_FARE_10H  PASSENGER_COUNT  AVG_FARE_1H  \
0           2.23         12.5      9.324965                1     9.440415   
1           1.70          7.0      9.236685                1    10.083333   
2           3.11         16.5     10.287410                1    10.385390   

   OUTPUT_TOTAL_AMOUNT  
0            16.440312  
1             8.523669  
2            18.717726

Clean up notebook

This cell will drop the schemas have been created at beginning of this notebook, and also drop all objects live in the schemas including source data tables, feature views, datasets, and models.

[22]

[Row(status='SNOWFLAKE_FEATURE_STORE_NOTEBOOK_DEMO_MODEL successfully dropped.')]