Auto Ml Forecasting Pipelines
Copyright (c) Microsoft Corporation. All rights reserved.
Licensed under the MIT License.
![]()
!Important!
This notebook is outdated and is not supported by the AutoML Team. Please use the supported version (link).
For examples illustrating how to build pipelines with components, please use the following links:
Training and Inferencing AutoML Forecasting Model Using Pipelines
Introduction
In this notebook, we demonstrate how to use piplines to train and inference on AutoML Forecasting model. Two pipelines will be created: one for training AutoML model, and the other is for inference on AutoML model. We'll also demonstrate how to schedule the inference pipeline so you can get inference results periodically (with refreshed test dataset). Make sure you have executed the configuration notebook before running this notebook. In this notebook you will learn how to:
- Configure AutoML using AutoMLConfig for forecasting tasks using pipeline AutoMLSteps.
- Create and register an AutoML model using AzureML pipeline.
- Inference and schdelue the pipeline using registered model.
Setup
As part of the setup you have already created an Azure ML Workspace object. For AutoML you will need to create an Experiment object, which is a named object in a Workspace used to run experiments.
This sample notebook may use features that are not available in previous versions of the Azure ML SDK.
Accessing the Azure ML workspace requires authentication with Azure.
The default authentication is interactive authentication using the default tenant. Executing the ws = Workspace.from_config() line in the cell below will prompt for authentication the first time that it is run.
If you have multiple Azure tenants, you can specify the tenant by replacing the ws = Workspace.from_config() line in the cell below with the following:
from azureml.core.authentication import InteractiveLoginAuthentication
auth = InteractiveLoginAuthentication(tenant_id = 'mytenantid')
ws = Workspace.from_config(auth = auth)
If you need to run in an environment where interactive login is not possible, you can use Service Principal authentication by replacing the ws = Workspace.from_config() line in the cell below with the following:
from azureml.core.authentication import ServicePrincipalAuthentication
auth = ServicePrincipalAuthentication('mytenantid', 'myappid', 'mypassword')
ws = Workspace.from_config(auth = auth)
For more details, see aka.ms/aml-notebook-auth
Compute
Compute
Create or Attach existing AmlCompute
You will need to create a compute target for your AutoML run. In this tutorial, you create AmlCompute as your training compute resource.
Note that if you have an AzureML Data Scientist role, you will not have permission to create compute resources. Talk to your workspace or IT admin to create the compute targets described in this section, if they do not already exist.
Creation of AmlCompute takes approximately 5 minutes.
If the AmlCompute with that name is already in your workspace this code will skip the creation process. As with other Azure services, there are limits on certain resources (e.g. AmlCompute) associated with the Azure Machine Learning service. Please read this article on the default limits and how to request more quota.
Data
You are now ready to load the historical orange juice sales data. For demonstration purposes, we extract sales time-series for just a few of the stores. We will load the CSV file into a plain pandas DataFrame; the time column in the CSV is called WeekStarting, so it will be specially parsed into the datetime type.
Each row in the DataFrame holds a quantity of weekly sales for an OJ brand at a single store. The data also includes the sales price, a flag indicating if the OJ brand was advertised in the store that week, and some customer demographic information based on the store location. For historical reasons, the data also include the logarithm of the sales quantity. The Dominick's grocery data is commonly used to illustrate econometric modeling techniques where logarithms of quantities are generally preferred.
The task is now to build a time-series model for the Quantity column. It is important to note that this dataset is comprised of many individual time-series - one for each unique combination of Store and Brand. To distinguish the individual time-series, we define the time_series_id_column_names - the columns whose values determine the boundaries between time-series:
Test Splitting
We now split the data into a training and a testing set for later forecast prediction. The test set will contain the final 4 weeks of observed sales for each time-series. The splits should be stratified by series, so we use a group-by statement on the time series identifier columns.
Upload data to datastore
The Machine Learning service workspace, is paired with the storage account, which contains the default data store. We will use it to upload the train and test data and create tabular datasets for training and testing. A tabular dataset defines a series of lazily-evaluated, immutable operations to load data from the data source into tabular representation.
Training
Modeling
For forecasting tasks, AutoML uses pre-processing and estimation steps that are specific to time-series. AutoML will undertake the following pre-processing steps:
- Detect time-series sample frequency (e.g. hourly, daily, weekly) and create new records for absent time points to make the series regular. A regular time series has a well-defined frequency and has a value at every sample point in a contiguous time span
- Impute missing values in the target (via forward-fill) and feature columns (using median column values)
- Create features based on time series identifiers to enable fixed effects across different series
- Create time-based features to assist in learning seasonal patterns
- Encode categorical variables to numeric quantities
In this notebook, AutoML will train a single, regression-type model across all time-series in a given training set. This allows the model to generalize across related series. If you're looking for training multiple models for different time-series, please see the many-models notebook.
You are almost ready to start an AutoML training job. First, we need to define the target column.
Forecasting Parameters
To define forecasting parameters for your experiment training, you can leverage the ForecastingParameters class. The table below details the forecasting parameter we will be passing into our experiment.
| Property | Description |
|---|---|
| time_column_name | The name of your time column. |
| forecast_horizon | The forecast horizon is how many periods forward you would like to forecast. This integer horizon is in units of the timeseries frequency (e.g. daily, weekly). |
| time_series_id_column_names | The column names used to uniquely identify the time series in data that has multiple rows with the same timestamp. If the time series identifiers are not defined, the data set is assumed to be one time series. |
| freq | Forecast frequency. This optional parameter represents the period with which the forecast is desired, for example, daily, weekly, yearly, etc. Use this parameter for the correction of time series containing irregular data points or for padding of short time series. The frequency needs to be a pandas offset alias. Please refer to pandas documentation for more information. |
| cv_step_size | Number of periods between two consecutive cross-validation folds. The default value is "auto", in which case AutoMl determines the cross-validation step size automatically, if a validation set is not provided. Or users could specify an integer value. |
Register Model Step
Run Configuration and Environment
To have a pipeline step run, we first need an environment to run the jobs. The environment can be build using the following code.
Step to register the model.
The following code generates a step to register the model to the workspace from previous step.
Build the Pipeline
Submit Pipeline Run
Get metrics for each runs
Inference
There are several ways to do the inference, for here we will demonstrate how to use the registered model and pipeline to do the inference. (how to register a model https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.model.model?view=azure-ml-py).
Get Inference Pipeline Environment
To trigger an inference pipeline run, we first need a running environment for run that contains all the appropriate packages for the model unpickling. This environment can be either assess from the training run or using the yml file that comes with the model.
After we have the environment for the inference, we could build run config based on this environment.
Build and submit the inference pipeline
The inference pipeline will create two different format of outputs, 1) a tabular dataset that contains the prediction and 2) an OutputFileDatasetConfig that can be used for the sequential pipeline steps.
Get the predicted data
Schedule Pipeline
This section is about how to schedule a pipeline for periodically predictions. For more info about pipeline schedule and pipeline endpoint, please follow this notebook.
If test_dataset is going to refresh every 4 weeks before Friday 16:00 and we want to predict every 4 weeks (forecast_horizon), we can schedule our pipeline to run every 4 weeks at 16:00 to get daily inference results. You can refresh your test dataset (a newer version will be created) periodically when new data is available (i.e. target column in test dataset would have values in the beginning as context data, and followed by NaNs to be predicted). The inference pipeline will pick up context to further improve the forecast accuracy.