Distributed Pytorch With Distributeddataparallel
Copyright (c) Microsoft Corporation. All rights reserved.
Licensed under the MIT License.
![]()
Distributed PyTorch with DistributedDataParallel
In this tutorial, you will train a PyTorch model on the CIFAR-10 dataset using distributed training with PyTorch's DistributedDataParallel module across a GPU cluster.
Prerequisites
- If you are using an Azure Machine Learning Notebook VM, you are all set. Otherwise, go through the Configuration notebook to install the Azure Machine Learning Python SDK and create an Azure ML
Workspace
Diagnostics
Opt-in diagnostics for better experience, quality, and security of future releases.
Initialize workspace
Initialize a Workspace object from the existing workspace you created in the Prerequisites step. Workspace.from_config() creates a workspace object from the details stored in config.json.
Create or attach existing AmlCompute
You will need to create a compute target for training your model. In this tutorial, we use Azure ML managed compute (AmlCompute) for our remote training compute resource. Specifically, the below code creates an Standard_NC6s_v3 GPU cluster that autoscales from 0 to 4 nodes.
Note that if you have an AzureML Data Scientist role, you will not have permission to create compute resources. Talk to your workspace or IT admin to create the compute targets described in this section, if they do not already exist.
Creation of AmlCompute takes approximately 5 minutes. If the AmlCompute with that name is already in your workspace, this code will skip the creation process.
As with other Azure services, there are limits on certain resources (e.g. AmlCompute) associated with the Azure Machine Learning service. Please read this article on the default limits and how to request more quota.
The above code creates GPU compute. If you instead want to create CPU compute, provide a different VM size to the vm_size parameter, such as STANDARD_D2_V2.
Prepare dataset
Prepare the dataset used for training. We will first download and extract the publicly available CIFAR-10 dataset from the cs.toronto.edu website and then create an Azure ML FileDataset to use the data for training.
Download and extract CIFAR-10 data
Create Azure ML dataset
The upload_directory method will upload the data to a datastore and create a FileDataset from it. In this tutorial we will use the workspace's default datastore.
Train model on the remote compute
Now that we have the AmlCompute ready to go, let's run our distributed training job.
Create a project directory
Create a directory that will contain all the necessary code from your local machine that you will need access to on the remote resource. This includes the training script and any additional files your training script depends on.
Prepare training script
Now you will need to create your training script. In this tutorial, the script for distributed training on CIFAR-10 is already provided for you at train.py. In practice, you should be able to take any custom PyTorch training script as is and run it with Azure ML without having to modify your code.
Once your script is ready, copy the training script train.py into the project directory.
Create an experiment
Create an Experiment to track all the runs in your workspace for this distributed PyTorch tutorial.
Create an environment
In this tutorial, we will use one of Azure ML's curated PyTorch environments for training. Curated environments are available in your workspace by default. Specifically, we will use the PyTorch 2.0 GPU curated environment.
Configure the training job
To launch a distributed PyTorch job on Azure ML, you have two options:
- Per-process launch - specify the total # of worker processes (typically one per GPU) you want to run, and Azure ML will handle launching each process.
- Per-node launch with torch.distributed.launch - provide the
torch.distributed.launchcommand you want to run on each node.
For more information, see the documentation.
Both options are shown below.
Per-process launch
To use the per-process launch option in which Azure ML will handle launching each of the processes to run your training script,
- Specify the training script and arguments
- Create a
PyTorchConfigurationand specifynode_countandprocess_count. Theprocess_countis the total number of processes you want to run for the job; this should typically equal the # of GPUs available on each node multiplied by the # of nodes. Since this tutorial uses theStandard_NC6s_v3SKU, which has one GPU, the total process count for a 2-node job is2. If you are using a SKU with >1 GPUs, adjust theprocess_countaccordingly.
Azure ML will set the MASTER_ADDR, MASTER_PORT, NODE_RANK, WORLD_SIZE environment variables on each node, in addition to the process-level RANK and LOCAL_RANK environment variables, that are needed for distributed PyTorch training.
Per-node launch with torch.distributed.launch
If you would instead like to use the PyTorch-provided launch utility torch.distributed.launch to handle launching the worker processes on each node, you can do so as well.
-
Provide the launch command to the
commandparameter of ScriptRunConfig. For PyTorch jobs Azure ML will set theMASTER_ADDR,MASTER_PORT, andNODE_RANKenvironment variables on each node, so you can simply just reference those environment variables in your command. If you are using a SKU with >1 GPUs, adjust the--nproc_per_nodeargument accordingly. -
Create a
PyTorchConfigurationand specify thenode_count. You do not need to specify theprocess_count; by default Azure ML will launch one process per node to run thecommandyou provided.
Uncomment the code below to configure a job with this method.
Submit job
Run your experiment by submitting your ScriptRunConfig object. Note that this call is asynchronous.
Monitor your run
You can monitor the progress of the run with a Jupyter widget. Like the run submission, the widget is asynchronous and provides live updates every 10-15 seconds until the job completes. You can see that the widget automatically plots and visualizes the loss metric that we logged to the Azure ML run.
Alternatively, you can block until the script has completed training before running more code.