Notebooks
A
Azure
Distributed Pytorch With Distributeddataparallel

Distributed Pytorch With Distributeddataparallel

how-to-use-azuremlazure-mldata-sciencenotebookmachine-learningazure-machine-learningdistributed-pytorch-with-distributeddataparalleldeep-learningazuremlazure-ml-notebooksazureml-frameworkspytorch

Copyright (c) Microsoft Corporation. All rights reserved.

Licensed under the MIT License.

Impressions

Distributed PyTorch with DistributedDataParallel

In this tutorial, you will train a PyTorch model on the CIFAR-10 dataset using distributed training with PyTorch's DistributedDataParallel module across a GPU cluster.

Prerequisites

  • If you are using an Azure Machine Learning Notebook VM, you are all set. Otherwise, go through the Configuration notebook to install the Azure Machine Learning Python SDK and create an Azure ML Workspace
[ ]

Diagnostics

Opt-in diagnostics for better experience, quality, and security of future releases.

[ ]

Initialize workspace

Initialize a Workspace object from the existing workspace you created in the Prerequisites step. Workspace.from_config() creates a workspace object from the details stored in config.json.

[ ]

Create or attach existing AmlCompute

You will need to create a compute target for training your model. In this tutorial, we use Azure ML managed compute (AmlCompute) for our remote training compute resource. Specifically, the below code creates an Standard_NC6s_v3 GPU cluster that autoscales from 0 to 4 nodes.

Note that if you have an AzureML Data Scientist role, you will not have permission to create compute resources. Talk to your workspace or IT admin to create the compute targets described in this section, if they do not already exist.

Creation of AmlCompute takes approximately 5 minutes. If the AmlCompute with that name is already in your workspace, this code will skip the creation process.

As with other Azure services, there are limits on certain resources (e.g. AmlCompute) associated with the Azure Machine Learning service. Please read this article on the default limits and how to request more quota.

[ ]

The above code creates GPU compute. If you instead want to create CPU compute, provide a different VM size to the vm_size parameter, such as STANDARD_D2_V2.

Prepare dataset

Prepare the dataset used for training. We will first download and extract the publicly available CIFAR-10 dataset from the cs.toronto.edu website and then create an Azure ML FileDataset to use the data for training.

Download and extract CIFAR-10 data

[ ]

Create Azure ML dataset

The upload_directory method will upload the data to a datastore and create a FileDataset from it. In this tutorial we will use the workspace's default datastore.

[ ]

Train model on the remote compute

Now that we have the AmlCompute ready to go, let's run our distributed training job.

Create a project directory

Create a directory that will contain all the necessary code from your local machine that you will need access to on the remote resource. This includes the training script and any additional files your training script depends on.

[ ]

Prepare training script

Now you will need to create your training script. In this tutorial, the script for distributed training on CIFAR-10 is already provided for you at train.py. In practice, you should be able to take any custom PyTorch training script as is and run it with Azure ML without having to modify your code.

Once your script is ready, copy the training script train.py into the project directory.

[ ]

Create an experiment

Create an Experiment to track all the runs in your workspace for this distributed PyTorch tutorial.

[ ]

Create an environment

In this tutorial, we will use one of Azure ML's curated PyTorch environments for training. Curated environments are available in your workspace by default. Specifically, we will use the PyTorch 2.0 GPU curated environment.

[ ]

Configure the training job

To launch a distributed PyTorch job on Azure ML, you have two options:

  1. Per-process launch - specify the total # of worker processes (typically one per GPU) you want to run, and Azure ML will handle launching each process.
  2. Per-node launch with torch.distributed.launch - provide the torch.distributed.launch command you want to run on each node.

For more information, see the documentation.

Both options are shown below.

Per-process launch

To use the per-process launch option in which Azure ML will handle launching each of the processes to run your training script,

  1. Specify the training script and arguments
  2. Create a PyTorchConfiguration and specify node_count and process_count. The process_count is the total number of processes you want to run for the job; this should typically equal the # of GPUs available on each node multiplied by the # of nodes. Since this tutorial uses the Standard_NC6s_v3 SKU, which has one GPU, the total process count for a 2-node job is 2. If you are using a SKU with >1 GPUs, adjust the process_count accordingly.

Azure ML will set the MASTER_ADDR, MASTER_PORT, NODE_RANK, WORLD_SIZE environment variables on each node, in addition to the process-level RANK and LOCAL_RANK environment variables, that are needed for distributed PyTorch training.

[ ]

Per-node launch with torch.distributed.launch

If you would instead like to use the PyTorch-provided launch utility torch.distributed.launch to handle launching the worker processes on each node, you can do so as well.

  1. Provide the launch command to the command parameter of ScriptRunConfig. For PyTorch jobs Azure ML will set the MASTER_ADDR, MASTER_PORT, and NODE_RANK environment variables on each node, so you can simply just reference those environment variables in your command. If you are using a SKU with >1 GPUs, adjust the --nproc_per_node argument accordingly.

  2. Create a PyTorchConfiguration and specify the node_count. You do not need to specify the process_count; by default Azure ML will launch one process per node to run the command you provided.

Uncomment the code below to configure a job with this method.

[ ]

Submit job

Run your experiment by submitting your ScriptRunConfig object. Note that this call is asynchronous.

[ ]

Monitor your run

You can monitor the progress of the run with a Jupyter widget. Like the run submission, the widget is asynchronous and provides live updates every 10-15 seconds until the job completes. You can see that the widget automatically plots and visualizes the loss metric that we logged to the Azure ML run.

[ ]

Alternatively, you can block until the script has completed training before running more code.

[ ]