Aml Pipelines Parameter Tuning With Hyperdrive
Copyright (c) Microsoft Corporation. All rights reserved.
Licensed under the MIT License.
![]()
Azure Machine Learning Pipeline with HyperDriveStep
This notebook is used to demonstrate the use of HyperDriveStep in AML Pipeline.
Prerequisites and Azure Machine Learning Basics
If you are using an Azure Machine Learning Notebook VM, you are all set. Otherwise, make sure you go through the configuration Notebook first if you haven't. This sets you up with a working config file that has information on your workspace, subscription id, etc.
Azure Machine Learning and Pipeline SDK-specific imports
Initialize workspace
Initialize a workspace object from persisted configuration. If you are using an Azure Machine Learning Notebook VM, you are all set. Otherwise, make sure the config file is present at .\config.json
Create an Azure ML experiment
Let's create an experiment named "tf-mnist" and a folder to hold the training scripts.
The best practice is to use separate folders for scripts and its dependent files for each step. This helps reduce the size of the snapshot created for the step (only the specific folder is snapshotted). Since changes in any files in the
source_directorywould trigger a re-upload of the snapshot, this helps keep the reuse of the step when there are no changes in thesource_directoryof the step.
The script runs will be recorded under the experiment in Azure.
Download MNIST dataset
In order to train on the MNIST dataset we will first need to download it from Yan LeCun's web site directly and save them in a data folder locally.
Show some sample images
Let's load the downloaded compressed file into numpy arrays using some utility functions included in the utils.py library file from the current folder. Then we use matplotlib to plot 30 random images from the dataset along with their labels.
Upload MNIST dataset to blob datastore
A datastore is a place where data can be stored that is then made accessible to a Run either by means of mounting or copying the data to the compute target. In the next step, we will use Azure Blob Storage and upload the training and test set into the Azure Blob datastore, which we will then later be mount on a Batch AI cluster for training.
Create Azure Machine Learning datasets
By creating a dataset, you create a reference to the data source location. If you applied any subsetting transformations to the dataset, they will be stored in the dataset as well. The data remains in its existing location, so no extra storage cost is incurred.
Retrieve or create a Azure Machine Learning compute
Azure Machine Learning Compute is a service for provisioning and managing clusters of Azure virtual machines for running machine learning workloads. Let's create a new Azure Machine Learning Compute in the current workspace, if it doesn't already exist. We will then run the training script on this compute target.
Note that if you have an AzureML Data Scientist role, you will not have permission to create compute resources. Talk to your workspace or IT admin to create the compute targets described in this section, if they do not already exist.
If we could not find the compute with the given name in the previous cell, then we will create a new compute here. This process is broken down into the following steps:
- Create the configuration
- Create the Azure Machine Learning compute
This process will take a few minutes and is providing only sparse output in the process. Please make sure to wait until the call returns before moving to the next cell.
Copy the training files into the script folder
The TensorFlow training script is already created for you. You can simply copy it into the script folder, together with the utility library used to load compressed data file into numpy array.
Retrieve an Environment
In this tutorial, we will use one of Azure ML's curated TensorFlow environments for training. Curated environments are available in your workspace by default. Specifically, we will use the TensorFlow 2.0 GPU curated environment.
Setup an input for the ScriptRunConfig step
You can mount dataset to remote compute.
Configure the training job
Create a ScriptRunConfig object to specify the configuration details of your training job, including your training script, environment to use, and the compute target to run on
Intelligent hyperparameter tuning
Now let's try hyperparameter tuning by launching multiple runs on the cluster. First let's define the parameter space using random sampling.
In this example we will use random sampling to try different configuration sets of hyperparameters to maximize our primary metric, the best validation accuracy (validation_acc).
Now we will define an early termnination policy. The BanditPolicy basically states to check the job every 2 iterations. If the primary metric (defined later) falls outside of the top 10% range, Azure ML terminate the job. This saves us from continuing to explore hyperparameters that don't show promise of helping reach our target metric.
Refer here for more information on the BanditPolicy and other policies available.
Now we are ready to configure a run configuration object, and specify the primary metric validation_acc that's recorded in your training runs. If you go back to visit the training script, you will notice that this value is being logged after every epoch (a full batch set). We also want to tell the service that we are looking to maximizing this value. We also set the number of samples to 20, and maximal concurrent job to 4, which is the same as the number of nodes in our computer cluster.
HyperDriveStep
HyperDriveStep can be used to run HyperDrive job as a step in pipeline.
- name: Name of the step
- hyperdrive_config: A HyperDriveConfig that defines the configuration for this HyperDrive run
- inputs: List of input port bindings
- outputs: List of output port bindings
- metrics_output: Optional value specifying the location to store HyperDrive run metrics as a JSON file
- allow_reuse: whether to allow reuse
- version: version
Find and register best model
When all the jobs finish, we can choose to register the model that has the highest accuracy through an additional PythonScriptStep.
Through this additional register_model_step, we register the chosen files as a model named tf-dnn-mnist under the workspace for deployment.
Run the pipeline
Monitor using widget
Wait for the completion of this Pipeline run
Retrieve the metrics
Outputs of above run can be used as inputs of other steps in pipeline. In this tutorial, we will show the result metrics.
For model deployment, please refer to Training, hyperparameter tune, and deploy with TensorFlow.