Notebooks
A
Azure
Auto Ml Classification Bank Marketing All Features

Auto Ml Classification Bank Marketing All Features

how-to-use-azuremlazure-mldata-sciencenotebookmachine-learningazure-machine-learningautomated-machine-learningdeep-learningazuremlazure-ml-notebooksazureclassification-bank-marketing-all-features

Copyright (c) Microsoft Corporation. All rights reserved.

Licensed under the MIT License.

Impressions

Automated Machine Learning

Classification with Deployment using a Bank Marketing Dataset

Contents

  1. Introduction
  2. Setup
  3. Train
  4. Results
  5. Deploy
  6. Test
  7. Use auto-generated code for retraining
  8. Acknowledgements

Introduction

In this example we use the UCI Bank Marketing dataset to showcase how you can use AutoML for a classification problem and deploy it to an Azure Container Instance (ACI). The classification goal is to predict if the client will subscribe to a term deposit with the bank.

If you are using an Azure Machine Learning Compute Instance, you are all set. Otherwise, go through the configuration notebook first if you haven't already to establish your connection to the AzureML Workspace.

Please find the ONNX related documentations here.

In this notebook you will learn how to:

  1. Create an experiment using an existing workspace.
  2. Configure AutoML using AutoMLConfig.
  3. Train the model using local compute with ONNX compatible config on.
  4. Explore the results, featurization transparency options and save the ONNX model
  5. Inference with the ONNX model.
  6. Register the model.
  7. Create a container image.
  8. Create an Azure Container Instance (ACI) service.
  9. Test the ACI service.
  10. Leverage the auto generated training code and use it for retraining on an updated dataset

In addition this notebook showcases the following features

  • Blocking certain pipelines
  • Specifying target metrics to indicate stopping criteria
  • Handling missing data in the input

Setup

As part of the setup you have already created an Azure ML Workspace object. For AutoML you will need to create an Experiment object, which is a named object in a Workspace used to run experiments.

[ ]

This sample notebook may use features that are not available in previous versions of the Azure ML SDK.

Accessing the Azure ML workspace requires authentication with Azure.

The default authentication is interactive authentication using the default tenant. Executing the ws = Workspace.from_config() line in the cell below will prompt for authentication the first time that it is run.

If you have multiple Azure tenants, you can specify the tenant by replacing the ws = Workspace.from_config() line in the cell below with the following:

	from azureml.core.authentication import InteractiveLoginAuthentication
auth = InteractiveLoginAuthentication(tenant_id = 'mytenantid')
ws = Workspace.from_config(auth = auth)

If you need to run in an environment where interactive login is not possible, you can use Service Principal authentication by replacing the ws = Workspace.from_config() line in the cell below with the following:

	from azureml.core.authentication import ServicePrincipalAuthentication
auth = auth = ServicePrincipalAuthentication('mytenantid', 'myappid', 'mypassword')
ws = Workspace.from_config(auth = auth)

For more details, see aka.ms/aml-notebook-auth

[ ]

Create or Attach existing AmlCompute

You will need to create a compute target for your AutoML run. In this tutorial, you create AmlCompute as your training compute resource.

Note that if you have an AzureML Data Scientist role, you will not have permission to create compute resources. Talk to your workspace or IT admin to create the compute targets described in this section, if they do not already exist.

Creation of AmlCompute takes approximately 5 minutes.

If the AmlCompute with that name is already in your workspace this code will skip the creation process. As with other Azure services, there are limits on certain resources (e.g. AmlCompute) associated with the Azure Machine Learning service. Please read this article on the default limits and how to request more quota.

[ ]

Data

Load Data

Leverage azure compute to load the bank marketing dataset as a Tabular Dataset into the dataset variable.

Training Data

[ ]
[ ]
[ ]

Validation Data

[ ]

Test Data

[ ]

Train

Instantiate a AutoMLConfig object. This defines the settings and data used to run the experiment.

PropertyDescription
taskclassification or regression or forecasting
primary_metricThis is the metric that you want to optimize. Classification supports the following primary metrics:
accuracy
AUC_weighted
average_precision_score_weighted
norm_macro_recall
precision_score_weighted
iteration_timeout_minutesTime limit in minutes for each iteration.
blocked_modelsList of strings indicating machine learning algorithms for AutoML to avoid in this run.

Allowed values for Classification
LogisticRegression
SGD
MultinomialNaiveBayes
BernoulliNaiveBayes
SVM
LinearSVM
KNN
DecisionTree
RandomForest
ExtremeRandomTrees
LightGBM
GradientBoosting
TensorFlowDNN
TensorFlowLinearClassifier

Allowed values for Regression
ElasticNet
GradientBoosting
DecisionTree
KNN
LassoLars
SGD
RandomForest
ExtremeRandomTrees
LightGBM
TensorFlowLinearRegressor
TensorFlowDNN

Allowed values for Forecasting
ElasticNet
GradientBoosting
DecisionTree
KNN
LassoLars
SGD
RandomForest
ExtremeRandomTrees
LightGBM
TensorFlowLinearRegressor
TensorFlowDNN
Arima
Prophet
allowed_modelsList of strings indicating machine learning algorithms for AutoML to use in this run. Same values listed above for blocked_models allowed for allowed_models.
experiment_exit_scoreValue indicating the target for primary_metric.
Once the target is surpassed the run terminates.
experiment_timeout_hoursMaximum amount of time in hours that all iterations combined can take before the experiment terminates.
enable_early_stoppingFlag to enble early termination if the score is not improving in the short term.
featurization'auto' / 'off' Indicator for whether featurization step should be done automatically or not. Note: If the input data is sparse, featurization cannot be turned on.
n_cross_validationsNumber of cross validation splits.
training_dataInput dataset, containing both features and label column.
label_column_nameThe name of the label column.
enable_code_generationFlag to enable generation of training code for each of the models that AutoML is creating.

You can find more information about primary metrics here

[ ]

Call the submit method on the experiment object and pass the run configuration. Execution of local runs is synchronous. Depending on the data and the number of iterations this can run for a while. Validation errors and current status will be shown when setting show_output=True and the execution will be synchronous.

[ ]

Run the following cell to access previous runs. Uncomment the cell below and update the run_id.

[ ]
[ ]
[ ]

Transparency

View featurization summary for the best model - to study how different features were transformed. This is stored as a JSON file in the outputs directory for the run.

[ ]

Results

[ ]

Retrieve the Best Model's explanation

Retrieve the explanation from the best_run which includes explanations for engineered features and raw features. Make sure that the run for generating explanations for the best model is completed.

[ ]

Download engineered feature importance from artifact store

You can use ExplanationClient to download the engineered feature explanations from the artifact store of the best_run.

[ ]

Download raw feature importance from artifact store

You can use ExplanationClient to download the raw feature explanations from the artifact store of the best_run.

[ ]

Retrieve the Best ONNX Model

Below we select the best pipeline from our iterations. The get_output method returns the best run and the fitted model. The Model includes the pipeline and any pre-processing. Overloads on get_output allow you to retrieve the best run and fitted model for any logged metric or for a particular iteration.

Set the parameter return_onnx_model=True to retrieve the best ONNX model, instead of the Python model.

[ ]

Save the best ONNX model

[ ]

Predict with the ONNX model, using onnxruntime package

[ ]

Deploy

Retrieve the Best Model

Below we select the best pipeline from our iterations. The get_best_child method returns the Run object for the best model based on the default primary metric. There are additional flags that can be passed to the method if we want to retrieve the best Run based on any of the other supported metrics, or if we are just interested in the best run among the ONNX compatible runs. As always, you can execute ??remote_run.get_best_child in a new cell to view the source or docs for the function.

[ ]

Widget for Monitoring Runs

The widget will first report a "loading" status while running the first iteration. After completing the first iteration, an auto-updating graph and table will be shown. The widget will refresh once per minute, so you should see the graph update as child runs complete.

Note: The widget displays a link at the bottom. Use this link to open a web interface to explore the individual run details

[ ]
[ ]

Register the Fitted Model for Deployment

If neither metric nor iteration are specified in the register_model call, the iteration with the best primary metric is registered.

[ ]

Deploy the model as a Web Service on Azure Container Instance

[ ]

Get Logs from a Deployed Web Service

Gets logs from a deployed web service.

[ ]

Test

Now that the model is trained, run the test data through the trained model to get the predicted values. This calls the ACI web service to do the prediction.

Note that the JSON passed to the ACI web service is an array of rows of data. Each row should either be an array of values in the same order that was used for training or a dictionary where the keys are the same as the column names used for training. The example below uses dictionary rows.

[ ]
[ ]
[ ]
[ ]
[ ]

Calculate metrics for the prediction

Now visualize the data as a confusion matrix that compared the predicted values against the actual values.

[ ]

Delete a Web Service

Deletes the specified web service.

[ ]

Using the auto generated model training code for retraining on new data

Because we enabled code generation when the original experiment was created, we now have access to the code that was used to generate any of the AutoML tried models. Below we'll be using the generated training script of the best model to retrain on a new dataset.

For this demo, we'll begin by creating new retraining dataset by combining the Train & Validation datasets that were used in the original experiment.

[ ]

Next, we'll download the generated script for the best run and use it for retraining. For more advanced scenarios, you can customize the training script as you need: change the featurization pipeline, change the learner algorithm or its hyperparameters, etc.

For this exercise, we'll leave the script as it was generated.

[ ]
[ ]

After the run completes, we can get download/test/deploy to the model it has built.

[ ]

Acknowledgements

This Bank Marketing dataset is made available under the Creative Commons (CCO: Public Domain) License: https://creativecommons.org/publicdomain/zero/1.0/. Any rights in individual contents of the database are licensed under the Database Contents License: https://creativecommons.org/publicdomain/zero/1.0/ and is available at: https://www.kaggle.com/janiobachmann/bank-marketing-dataset .

Acknowledgements This data set is originally available within the UCI Machine Learning Database: https://archive.ics.uci.edu/ml/datasets/bank+marketing

[Moro et al., 2014] S. Moro, P. Cortez and P. Rita. A Data-Driven Approach to Predict the Success of Bank Telemarketing. Decision Support Systems, Elsevier, 62:22-31, June 2014