Sm Pipeline With Amazon Forecast
Creating an Amazon Forecast Predictor with SageMaker Pipelines
This notebook's CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook.
This example notebook showcases how you can create a dataset, dataset group and predictor with Amazon Forecast and SageMaker Pipelines. This demo is designed to run on SageMaker Notebook Instances. As of February 2022, this code will not properly execute in SageMaker Studio, due to a docker limitation on SageMaker Studio Notebooks.
Integrating SageMaker Pipelines with Amazon Forecast is useful for the following three reasons:
- Iteratively improve your model by tracking the performance of each execution using SageMaker Experiments.
- Reproducibility of Forecast experiments.
- Decouple different processes in your Amazon Forecast machine learning project and visualize these in a Directed Acyclic Graph using SageMaker Pipelines.
This notebook can be used as a template to start training your own Forecast predictors with SageMaker Pipelines. Before you start, make sure that your SageMaker Execution Role has the following policies:
AmazonForecastFullAccessAmazonSageMakerFullAccess
Your SageMaker Execution Role should have access to S3 already. If not you can add an S3 policy. You will also need to the inline policy described below:
Finally, you will need the following trust policies.
Prerequisites
First, we are going to import the SageMaker SDK and set some default variables such as the role for permissioned execution and the default_bucket to store model artifacts.
Then, we have to update the base Scikit-learn SageMaker image to update boto3 and botocore. As of February 2022, the Scikit-learn image has an older version of botocore (1.19.4) which does not yet contain code for API calls you need to make to Amazon Forecast.
The script below creates an ECR repository with the given repo_name within your AWS account in the region you are running this notebook from. It then pulls as base image the Prebuilt Amazon SageMaker Docker Image for Scikit-learn. This notebook automatically selects the correct image_acc_id for the region you're in using the region_to_account_id dictionary, according to https://docs.aws.amazon.com/sagemaker/latest/dg/pre-built-docker-containers-scikit-learn-spark.html.
Dataset
Let's inspect the train dataset we will be using in this example.
The dataset happens to span January 01, 2011, to January 01, 2015. We are only going to use about two and a half week's of hourly data to train Amazon Forecast. We will copy the dataset from this local directory to s3 so that SageMaker can access it.
Next, we define parameters that can be set for the execution of the pipeline. They serve as variables. We define the following:
ProcessingInstanceCount: The number of processing instances to use for the execution of the pipelineProcessingInstanceType: The type of processing instances to use for the execution of the pipelineTrainingInstanceCount: The number of training instances to use for the execution of the pipelineTrainingInstanceType: The type of training instances to use for the execution of the pipelineTrainData: Location of the training data in S3ModelOutput: Location of the target S3 path for the Amazon Forecast model artifact
Amazon Forecast creates its own validation set when training, so there is no need to provide one.
We also define some important parameters to choose, train and evaluate the model
ForecastHorizon: The Forecast Horizon (Prediction length)ForecastAlgorithm: What algorithm to use from Amazon Forecast (ex DeepArPlus, CNNQR, ...)EvaluationMetric: The evaluation metric used to select (keep) the modelMaxScore: The evaluation metric's threshold to select (keep) the model
We use an updated SKLearnProcessor to run Python scripts to build a dataset group and train an Amazon Forecast predictor using boto3. In the next chunk, we instantiate an instance of ScriptProcessor, which is essentially an SKLearnProcessor with updated boto3 and botocore (as built above) that we use in the next steps.
First we preprocess the data using an Amazon SageMaker ProcessingStep that provides a containerized execution environment to run the preprocess.py script.
The next step is to train and evaluate the forecasting model calling Amazon Forecast using boto3. We instantiate an instance of SKLearn estimator that we use in the next TrainingStep to run the script train.py.
Amazon Forecast automatically evaluates the performance on an evaluation set. We will use that score as a condition for deploying the model.
The algorithm training is managed by Amazon Forecast. We use a TrainingStep instead of a ProcessingStep to log the metrics with SageMaker Experiments.
The third step is an Amazon SageMaker ProcessingStep that deletes or keeps the Amazon Forecast model running using the script conditional_delete.py. If the error reported after training is higher than a threshold you specify for the metric you specify, this step deletes all the resources created by Amazon Forecast that are related to the pipeline's execution.
Finally, we combine all the steps and define our pipeline.
Once the pipeline is successfully defined, we can start the execution.
Experiments Tracking
Each pipeline execution is tracked by default when using SageMaker Pipelines. To find the experiment tracking in SageMaker Studio, you should open SageMaker Resources and select Experiments and Trials. The experiments and trials are organized as follows:
- The Pipeline ForecastPipeline is associated with an Experiment.
- Each execution of Pipeline ForecastPipeline is associated with a trial.
- Each step within the execution is associated with a trial component within trial.
- Each execution of Pipeline ForecastPipeline is associated with a trial.
To find the Trial Components and Trial name generated by a pipeline execution in ForecastPipeline, you:
- Open SageMaker Resources and select Experiments and Trials.
- Right-Click on your Pipeline’s name (ForecastPipeline) and select Open in trial component list
- You can now filter the trial components and customize the table view as presented in View and Compare Amazon SageMaker Experiments, Trials, and Trial Components.
The AWS documentation for Experiments Tracking can be found under Manage Machine Learning with Amazon SageMaker Experiments.
Conclusion
In this notebook we have seen how to create a SageMaker Pipeline to train an Amazon Forecast predictor on your own dataset with a target and related time series.
Clean up
Feel free to clean up all related resources (the pipeline, s3 object (train.csv), all Forecast related resources) that could potentially incur costs
Notebook CI Test Results
This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.