Azure Train Tensorflow Resume Training

Train Tensorflow Resume Training

how-to-use-azuremlazure-mltensorflowdata-sciencenotebooktrain-tensorflow-resume-trainingmachine-learningazure-machine-learningdeep-learningazuremlazure-ml-notebooksazureml-frameworks

alph-notebooks/azure-ml-notebooks / train-tensorflow-resume-training.ipynb

Export

Run Notebooks

Contents

No cells yet

Add cells to see them here

Licensed under the MIT License.

Impressions

Resuming Tensorflow training from previous run

In this tutorial, you will resume a mnist model in TensorFlow from a previously submitted run.

Prerequisites

Understand the architecture and terms introduced by Azure Machine Learning (AML)
Go through the configuration notebook to:
- install the AML SDK
- create a workspace and its configuration file (config.json)
Review the tutorial on single-node TensorFlow training using the SDK

[ ]

Diagnostics

Opt-in diagnostics for better experience, quality, and security of future releases.

[ ]

Initialize workspace

Initialize a Workspace object from the existing workspace you created in the Prerequisites step. Workspace.from_config() creates a workspace object from the details stored in config.json.

[ ]

Create or Attach existing AmlCompute

You will need to create a compute target for training your model. In this tutorial, you create AmlCompute as your training compute resource.

Note that if you have an AzureML Data Scientist role, you will not have permission to create compute resources. Talk to your workspace or IT admin to create the compute targets described in this section, if they do not already exist.

Creation of AmlCompute takes approximately 5 minutes. If the AmlCompute with that name is already in your workspace this code will skip the creation process.

As with other Azure services, there are limits on certain resources (e.g. AmlCompute) associated with the Azure Machine Learning service. Please read this article on the default limits and how to request more quota.

[ ]

The above code creates a GPU cluster. If you instead want to create a CPU cluster, provide a different VM size to the vm_size parameter, such as STANDARD_D2_V2.

Create a Dataset for Files

A Dataset can reference single or multiple files in your datastores or public urls. The files can be of any format. Dataset provides you with the ability to download or mount the files to your compute. By creating a dataset, you create a reference to the data source location. If you applied any subsetting transformations to the dataset, they will be stored in the dataset as well. The data remains in its existing location, so no extra storage cost is incurred. Learn More

[ ]

you may want to register datasets using the register() method to your workspace so they can be shared with others, reused across various experiments, and referred to by name in your training script. You can try get the dataset first to see if it's already registered.

[ ]

Train model on the remote compute

Create a project directory

Create a directory that will contain all the necessary code from your local machine that you will need access to on the remote resource. This includes the training script, and any additional files your training script depends on.

[ ]

Copy the training script tf_mnist_with_checkpoint.py into this project directory.

[ ]

Create an experiment

Create an Experiment to track all the runs in your workspace for this distributed TensorFlow tutorial.

[ ]

Create an environment

In this tutorial, we will use one of Azure ML's curated TensorFlow environments for training. Curated environments are available in your workspace by default. Specifically, we will use the TensorFlow 1.13 GPU curated environment.

[ ]

Configure the training job

Create a ScriptRunConfig object to specify the configuration details of your training job, including your training script, environment to use, and the compute target to run on.

[ ]

Submit job

Run your experiment by submitting your ScriptRunConfig object. Note that this call is asynchronous.

[ ]

Monitor your run

You can monitor the progress of the run with a Jupyter widget. Like the run submission, the widget is asynchronous and provides live updates every 10-15 seconds until the job completes.

[ ]

Alternatively, you can block until the script has completed training before running more code.

[ ]

Now let's resume training from the above run

First, we will get the DataPath to the outputs directory of the above run which contains the checkpoint files. We will create a DataReference from this DataPath and specify the compute binding as mount mode; this will tell Azure ML to mount the checkpoint files on the compute target for the run.

[ ]

Now, we will create a new ScriptRunConfig and append the additional '--resume-from' argument with the corresponding checkpoint location to the arguments parameter.

[ ]

Now you can submit the experiment and it should resume from previous run's checkpoint files.

[ ]