Notebooks
A
Azure
Hyperparameter Tune And Warm Start With Tensorflow

Hyperparameter Tune And Warm Start With Tensorflow

how-to-use-azuremlazure-mltensorflowdata-sciencenotebookmachine-learningazure-machine-learningdeep-learningazuremlazure-ml-notebookshyperparameter-tune-and-warm-start-with-tensorflowazureml-frameworks

Copyright (c) Microsoft Corporation. All rights reserved.

Licensed under the MIT License.

Impressions

Warm start hyperparameter tuning

In this tutorial, you will learn how to warm start a hyperparameter tuning run from a previous tuning run.

Let's get started. First let's import some Python libraries.

[ ]
[ ]

Diagnostics

Opt-in diagnostics for better experience, quality, and security of future releases.

[ ]

Initialize workspace

Initialize a Workspace object from the existing workspace you created in the Prerequisites step. Workspace.from_config() creates a workspace object from the details stored in config.json.

[ ]

Create an Azure ML experiment

Let's create an experiment named "tf-mnist" and a folder to hold the training scripts. The script runs will be recorded under the experiment in Azure.

[ ]

Download MNIST dataset

In order to train on the MNIST dataset we will first need to download it from Yan LeCun's web site directly and save them in a data folder locally.

[ ]

Show some sample images

Let's load the downloaded compressed file into numpy arrays using some utility functions included in the utils.py library file from the current folder. Then we use matplotlib to plot 30 random images from the dataset along with their labels.

[ ]

Create a FileDataset

A FileDataset references single or multiple files in your datastores or public urls. The files can be of any format. FileDataset provides you with the ability to download or mount the files to your compute. By creating a dataset, you create a reference to the data source location. If you applied any subsetting transformations to the dataset, they will be stored in the dataset as well. The data remains in its existing location, so no extra storage cost is incurred. Learn More

[ ]

Use the register() method to register datasets to your workspace so they can be shared with others, reused across various experiments, and referred to by name in your training script. You can try get the dataset first to see if it's already registered.

[ ]

Create or Attach existing AmlCompute

You will need to create a compute target for training your model. In this tutorial, you create AmlCompute as your training compute resource.

Note that if you have an AzureML Data Scientist role, you will not have permission to create compute resources. Talk to your workspace or IT admin to create the compute targets described in this section, if they do not already exist.

If we could not find the cluster with the given name, then we will create a new cluster here. We will create an AmlCompute cluster of Standard_NC6s_v3 GPU VMs. This process is broken down into 3 steps:

  1. create the configuration (this step is local and only takes a second)
  2. create the cluster (this step will take about 20 seconds)
  3. provision the VMs to bring the cluster to the initial size (of 1 in this case). This step will take about 3-5 minutes and is providing only sparse output in the process. Please make sure to wait until the call returns before moving to the next cell
[ ]

Now that you have created the compute target, let's see what the workspace's compute_targets property returns. You should now see one entry named 'gpu-cluster' of type AmlCompute.

[ ]

Copy the training files into the script folder

The TensorFlow training script is already created for you. You can simply copy it into the script folder, together with the utility library used to load compressed data file into numpy array.

[ ]

Construct neural network in TensorFlow

In the training script tf_mnist.py, it creates a very simple DNN (deep neural network), with just 2 hidden layers. The input layer has 28 * 28 = 784 neurons, each representing a pixel in an image. The first hidden layer has 300 neurons, and the second hidden layer has 100 neurons. The output layer has 10 neurons, each representing a targeted label from 0 to 9.

DNN

Azure ML concepts

Please note the following three things in the code below:

  1. The script accepts arguments using the argparse package. In this case there is one argument --data_folder which specifies the file system folder in which the script can find the MNIST data
	    parser = argparse.ArgumentParser()
    parser.add_argument('--data_folder')

  1. The script is accessing the Azure ML Run object by executing run = Run.get_context(). Further down the script is using the run to report the training accuracy and the validation accuracy as training progresses.
	    run.log('training_acc', np.float(acc_train))
    run.log('validation_acc', np.float(acc_val))

  1. When running the script on Azure ML, you can write files out to a folder ./outputs that is relative to the root directory. This folder is specially tracked by Azure ML in the sense that any files written to that folder during script execution on the remote target will be picked up by Run History; these files (known as artifacts) will be available as part of the run history record.

The next cell will print out the training code for you to inspect it.

[ ]

Create an environment

In this tutorial, we will use one of Azure ML's curated TensorFlow environments for training. Curated environments are available in your workspace by default. Specifically, we will use the TensorFlow 2.0 GPU curated environment.

[ ]

Configure the training job¶

Create a ScriptRunConfig object to specify the configuration details of your training job, including your training script, environment to use, and the compute target to run on.

[ ]

Submit job to run

Submit the ScriptRunConfig to an Azure ML experiment to kick off the execution.

[ ]
[ ]

Intelligent hyperparameter tuning

Now that we have trained the model with one set of hyperparameters, we can tune the model hyperparameters to optimize model performance. First let's define the parameter space using random sampling. Typically, the hyperparameter exploration process is painstakingly manual, given that the search space is vast and evaluation of each configuration can be expensive.

Azure Machine Learning allows you to automate hyperparameter exploration in an efficient manner, saving you significant time and resources. You specify the range of hyperparameter values and a maximum number of training runs. The system then automatically launches multiple simultaneous runs with different parameter configurations and finds the configuration that results in the best performance, measured by the metric you choose. Poorly performing training runs are automatically early terminated, reducing wastage of compute resources. These resources are instead used to explore other hyperparameter configurations.

We start by defining the hyperparameter space. In this case, we will tune 4 hyperparameters - '--batch-size', '--first-layer-neurons', '--second-layer-neurons' and '--learning-rate'. For each of these hyperparameters, we specify the range of values they can take. In this example, we will use Random Sampling to randomly select hyperparameter values from the defined search space.

[ ]

Next, we will create a new ScriptRunConfig without the above parameters since they will be passed in later. Note we still need to keep the data-folder parameter since that's not a hyperparamter we will sweep.

[ ]

Next we will define an early termnination policy. This will terminate poorly performing runs automatically, reducing wastage of resources and instead efficiently using these resources for exploring other parameter configurations. In this example, we will use the TruncationSelectionPolicy, truncating the bottom performing 25% runs. It states to check the job every 2 iterations. If the primary metric (defined later) falls in the bottom 25% range, Azure ML terminate the job. This saves us from continuing to explore hyperparameters that don't show promise of helping reach our target metric.

[ ]

Now we are ready to configure a run configuration object, and specify the primary metric validation_acc that's recorded in your training runs. If you go back to visit the training script, you will notice that this value is being logged after every epoch (a full batch set). We also want to tell the service that we are looking to maximizing this value. We also set the number of samples to 15, and maximal concurrent job to 4, which is the same as the number of nodes in our computer cluster.

[ ]

Finally, let's launch the hyperparameter tuning job.

[ ]

We can use a run history widget to show the progress. Be patient as this might take a while to complete.

[ ]
[ ]
[ ]

Find and register best model

When all the jobs finish, we can find out the one that has the highest accuracy.

[ ]

Now let's list the model files uploaded during the run.

[ ]

We can then register the folder (and all files in it) as a model named tf-dnn-mnist under the workspace for deployment.

[ ]

Warm start a Hyperparameter Tuning experiment

Often times, finding the best hyperparameter values for your model can be an iterative process, needing multiple tuning runs that learn from previous hyperparameter tuning runs. Reusing knowledge from these previous runs will accelerate the hyperparameter tuning process, thereby reducing the cost of tuning the model and will potentially improve the primary metric of the resulting model. When warm starting a hyperparameter tuning experiment with Bayesian sampling, trials from the previous run will be used as prior knowledge to intelligently pick new samples, so as to improve the primary metric. Additionally, when using Random or Grid sampling, any early termination decisions will leverage metrics from the previous runs to determine poorly performing training runs.

Azure Machine Learning allows you to warm start your hyperparameter tuning run by leveraging knowledge from up to 5 previously completed hyperparameter tuning parent runs. In this example, we shall warm start from the initial hyperparameter tuning run in this notebook

[ ]

We can use the run history widget to show the progress of this warm start run. Be patient as this might take a while to complete.

[ ]
[ ]

Find and register best model from the warm start run

When all the jobs finish, we can find out the one that has the highest accuracy and register the folder (and all files in it) as a model named tf-dnn-mnist-warm-start under the workspace for deployment.

[ ]

Resuming individual training runs in a hyperparameter tuning experiment

In the previous section, we saw how you can warm start a hyperparameter tuning run, to learn from a previously completed run. Additionally, there might be occasions when individual training runs of a hyperparameter tuning experiment are cancelled due to budget constraints or fail due to other reasons. It is now possible to resume such individual training runs from the last checkpoint (assuming your training script handles checkpoints). Resuming an individual training run will use the same hyperparameter configuration and mount the storage used for that run. The training script should accept the "--resume-from" argument, which contains the checkpoint or model files from which to resume the training run.

You can also resume individual runs as part of an experiment that spends additional budget on hyperparameter tuning. Any additional budget, after resuming the specified training runs is used for exploring additional configurations.

In this example, we will resume one of the child runs cancelled in the previous hyperparameter tuning run in this notebook

[ ]

Next, we will configure the hyperparameter tuning experiment to warm start from the previous experiment and resume individual training runs and submit this warm start hyperparameter tuning run.

[ ]

We can use the run history widget to show the progress of this resumed run. Be patient as this might take a while to complete.

[ ]
[ ]

Find and register best model from the resumed run

When all the jobs finish, we can find out the one that has the highest accuracy and register the folder (and all files in it) as a model named tf-dnn-mnist-resumed under the workspace for deployment.

[ ]