End To End ML With Feature Store And Model Registry
- Required snowflake-ml-python version: >=1.6.1
- Last updated on: 8/26/2024
End to end ML with Feature Store and Model Registry
This notebook demonstrates an end-to-end ML experiment cycle including feature creation, training data generation, model training and inference. The workflow touches on key Snowflake ML features including Snowflake Feature Store, Dataset, ML Lineage, Snowpark ML Modeling and Snowflake Model Registry.
Note: there may be a delay in the availability of the newest snowflake-ml-python package in the Snowflake Conda channel. To install the latest snowflake-ml-python package which includes all of necessary components used in this notebook, please follow the install instructions here.
Table of contents
Set up test environment
Connect to Snowflake
Let's start with setting up our test environment. We will create a session and a schema. The schema FS_DEMO_SCHEMA will be used as the Feature Store. It will be cleaned up at the end of the demo. You need to fill the connection_parameters with your Snowflake connection information. Follow this guide for more details about how to connect to Snowflake.
[Row(status='Schema SNOWFLAKE_FEATURE_STORE_NOTEBOOK_DEMO successfully created.')]
Select your example
We have prepared some examples that you can find in our open source repo. Each example contains the source dataset, feature view and entity definitions which will be used in this demo. ExampleHelper (included in snowflake-ml-python) will setup everything with simple APIs and you don't have to worry about the details.
load_example() will load the source data into Snowflake tables. In the example below, we are using the “new_york_taxi_features” example. You can replace this with any example listed above. Execution of the cell below may take some time depending on the size of the dataset.
"AIRLINE_FEATURE_STORE".SNOWFLAKE_FEATURE_STORE_NOTEBOOK_DEMO.nyc_yellow_trips:
Initialize Feature Store
Let's first create a feature store client. With CREATE_IF_NOT_EXIST mode, it will try to create a new Feature Store schema and all necessary feature store metadata if it doesn't exist already. It is required for the first time to set up a Feature Store. Afterwards, you can use FAIL_IF_NOT_EXIST mode to connect to an existing Feature Store.
Note that the database being used must already exist. Feature Store will NOT try to create the database even in CREATE_IF_NOT_EXIST mode.
Register entities and feature views
Next we register new entities and feature views in Feature Store. Entities will be the join keys used to generate training data. Feature Views contains all the features you need for your model training and inference. We have entities and feature views for this example defined in our open source repo. We will load the definitions with load_entities() and load_draft_feature_views() for simplicity.
---------------------------------------------------------------------- |"NAME" |"JOIN_KEYS" |"DESC" |"OWNER" | ---------------------------------------------------------------------- |DOLOCATIONID |["DOLOCATIONID"] |Drop off location id. |ENGINEER | |TRIP_ID |["TRIP_ID"] |Trip id. |ENGINEER | ----------------------------------------------------------------------
------------------------------------------------------------------------------------------------ |"NAME" |"VERSION" |"DESC" |"REFRESH_FREQ" | ------------------------------------------------------------------------------------------------ |F_LOCATION |1.0 |Features aggregated by location id and refreshe... |12 hours | |F_TRIP |1.0 |Features per trip refreshed every day. |1 day | ------------------------------------------------------------------------------------------------
We can examine all features in a feature view.
F_LOCATION/1.0 has features:
F_TRIP/1.0 has features:
Generate Training Data
After our feature pipelines are fully setup, we can use them to generate Snowflake Dataset and later do model training. Generating training data is easy since materialized FeatureViews already carry most of the metadata like join keys, timestamp for point-in-time lookup, etc. We just need to provide the spine data (it's called spine because it is the list of entity IDs that we are essentially enriching by joining features with it).
generate_dataset() returns a Snowflake Dataset object, which is best for distributed training with deep learning frameworks like TensorFlow or Pytorch which requires fine-grained file-level access. It creates a new Dataset object (which is versioned and immutable) in Snowflake which materializes the data in Parquet files. If you train models with classic ML libraries like Snowpark ML or scikit-learn, you can use generate_training_set() which returns a classic Snowflake table. The Cell below demonstrates generate_dataset().
Retrieve some metadata columns that are essential when generating training data.
timestamp col: TPEP_PICKUP_DATETIME excluded cols: [] label cols: ['TOTAL_AMOUNT'] join keys: ['TRIP_ID', 'DOLOCATIONID'] training spine table: "AIRLINE_FEATURE_STORE".SNOWFLAKE_FEATURE_STORE_NOTEBOOK_DEMO.nyc_yellow_trips
Create a spine dataframe that's sampled from source table.
------------------------------------------------------------------------ |"TOTAL_AMOUNT" |"TRIP_ID" |"DOLOCATIONID" |"TPEP_PICKUP_DATETIME" | ------------------------------------------------------------------------ |11.8 |4391772 |236 |2016-01-13 15:28:31 | |6.8 |9640580 |231 |2016-01-28 21:47:03 | |10.3 |8986296 |162 |2016-01-27 06:44:50 | |20.35 |4689446 |261 |2016-01-14 09:29:27 | |19.89 |9360850 |166 |2016-01-28 07:33:07 | |6.3 |9335036 |211 |2016-01-28 04:46:46 | |72.92 |5223446 |264 |2016-01-15 17:21:27 | |16.3 |4578405 |116 |2016-01-13 23:35:00 | |7.3 |5045083 |163 |2016-01-15 07:10:06 | |10.3 |9733135 |145 |2016-01-29 05:14:06 | ------------------------------------------------------------------------
Generate dataset object from spine dataframe and feature views.
Convert dataset to a snowpark dataframe and examine all the features in it.
Train model with Snowpark ML
Now let's train a simple random forest model, and evaluate the prediction accuracy. When you call fit() on a DataFrame that is created from a Dataset, the linkage between the trained model and dataset is automatically wired up. Later, you can easily retrieve the training dataset from this model, or you can query the lineage about the dataset and model. This is work-in-progress and will be available soon in an upcoming release.
feature cols: ['TRIP_DISTANCE', 'FARE_AMOUNT', 'AVG_FARE_10H', 'PASSENGER_COUNT', 'AVG_FARE_1H'] MSE: 8.587654420611477, Accuracy: 99.83667856616516
Log model into Model Registry.
/Users/wezhou/miniconda3/envs/py38/lib/python3.8/contextlib.py:113: UserWarning: `relax_version` is not set and therefore defaulted to True. Dependency version constraints relaxed from ==x.y.z to >=x.y, <(x+1). To use specific dependency versions for compatibility, reproducibility, etc., set `options={'relax_version': False}` when logging the model.
return next(self.gen)
ModelVersion( , name='MY_RANDOM_FOREST_REGRESSOR_MODEL', , version='V1', ,)
WARNING:snowflake.snowpark:LineageNode.lineage() is in private preview since 1.5.3. Do not use it in production. WARNING:snowflake.snowpark:Lineage.trace() is in private preview since 1.16.0. Do not use it in production.
[Dataset( , name='AIRLINE_FEATURE_STORE.SNOWFLAKE_FEATURE_STORE_NOTEBOOK_DEMO.MY_COOL_TRAINING_DATASET', , version='4.0', , )]
[ModelVersion( , name='MY_RANDOM_FOREST_REGRESSOR_MODEL', , version='V1', , )]
[Dataset( name='AIRLINE_FEATURE_STORE.SNOWFLAKE_FEATURE_STORE_NOTEBOOK_DEMO.MY_COOL_TRAINING_DATASET', version='4.0', )] [Dataset( name='AIRLINE_FEATURE_STORE.SNOWFLAKE_FEATURE_STORE_NOTEBOOK_DEMO.MY_COOL_TRAINING_DATASET', version='4.0', )]
Finally we are almost ready for prediction! For this, we can look up the latest feature values from Feature Store for the specific data records that we are running prediction on. One of the key benefits of using the Feature Store is that it provides a way to automatically serve up the right feature values during prediction with point-in-time correct feature values. load_feature_views_from_dataset() gets the same feature views used in training, then retrieve_feature_values() lookups the latest feature values.
TOTAL_AMOUNT TPEP_PICKUP_DATETIME AVG_FARE_1H AVG_FARE_10H \ 0 15.96 2016-01-07 10:26:02 9.440415 9.324965 1 10.55 2016-01-01 18:44:40 10.083333 9.236685 2 17.80 2016-01-29 21:05:54 10.385390 10.287410 PASSENGER_COUNT TRIP_DISTANCE FARE_AMOUNT OUTPUT_TOTAL_AMOUNT 0 1 2.23 12.5 16.440312 1 1 1.70 7.0 8.523669 2 1 3.11 16.5 18.717726
TRIP_DISTANCE FARE_AMOUNT AVG_FARE_10H PASSENGER_COUNT AVG_FARE_1H \ 0 2.23 12.5 9.324965 1 9.440415 1 1.70 7.0 9.236685 1 10.083333 2 3.11 16.5 10.287410 1 10.385390 OUTPUT_TOTAL_AMOUNT 0 16.440312 1 8.523669 2 18.717726
[Row(status='SNOWFLAKE_FEATURE_STORE_NOTEBOOK_DEMO_MODEL successfully dropped.')]