Aml Pipelines Getting Started
Copyright (c) Microsoft Corporation. All rights reserved.
Licensed under the MIT License.
![]()
Azure Machine Learning Pipelines: Getting Started
Overview
A common scenario when using machine learning components is to have a data workflow that includes the following steps:
- Preparing/preprocessing a given dataset for training, followed by
- Training a machine learning model on this data, and then
- Deploying this trained model in a separate environment, and finally
- Running a batch scoring task on another data set, using the trained model.
Azure's Machine Learning pipelines give you a way to combine multiple steps like these into one configurable workflow, so that multiple agents/users can share and/or reuse this workflow. Machine learning pipelines thus provide a consistent, reproducible mechanism for building, evaluating, deploying, and running ML systems.
To get more information about Azure machine learning pipelines, please read our Azure Machine Learning Pipelines overview, or the readme article.
In this notebook, we provide a gentle introduction to Azure machine learning pipelines. We build a pipeline that runs jobs unattended on different compute clusters; in this notebook, you'll see how to use the basic Azure ML SDK APIs for constructing this pipeline.
Prerequisites and Azure Machine Learning Basics
If you are using an Azure Machine Learning Notebook VM, you are all set. Otherwise, make sure you go through the configuration notebook first if you haven't. This sets you up with a working config file that has information on your workspace, subscription id, etc.
Azure Machine Learning Imports
In this first code cell, we import key Azure Machine Learning modules that we will use below.
Pipeline-specific SDK imports
Here, we import key pipeline modules, whose use will be illustrated in the examples below.
Initialize Workspace
Initialize a [workspace](https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.workspace(class%29) object from persisted configuration.
Required data and script files for the the tutorial
Sample files required to finish this tutorial are already copied to the corresponding source_directory locations. Even though the .py provided in the samples does not have much "ML work" as a data scientist, you will work on this extensively as part of your work. To complete this tutorial, the contents of these files are not very important. The one-line files are for demostration purpose only.
Datastore concepts
A Datastore is a place where data can be stored that is then made accessible to a compute either by means of mounting or copying the data to the compute target.
A Datastore can either be backed by an Azure File Storage (default) or by an Azure Blob Storage.
In this next step, we will upload the training and test set into the workspace's default storage (File storage), and another piece of data to Azure Blob Storage. When to use Azure Blobs, Azure Files, or Azure Disks is detailed here.
Please take good note of the concept of the datastore.
Upload data to default datastore
Default datastore on workspace is the Azure File storage. The workspace has a Blob storage associated with it as well. Let's upload a file to each of these storages.
(Optional) See your files using Azure Portal
Once you successfully uploaded the files, you can browse to them (or upload more files) using Azure Portal. At the portal, make sure you have selected your subscription (click Resource Groups and then select the subscription). Then look for your Machine Learning Workspace name. It has a link to your storage. Click on the storage link. It will take you to a page where you can see Blobs, Files, Tables, and Queues. We have uploaded a file each to the Blob storage and to the File storage in the above step. You should be able to see both of these files in their respective locations.
Compute Targets
A compute target specifies where to execute your program such as a remote Docker on a VM, or a cluster. A compute target needs to be addressable and accessible by you.
You need at least one compute target to send your payload to. We are planning to use Azure Machine Learning Compute exclusively for this tutorial for all steps. However in some cases you may require multiple compute targets as some steps may run in one compute target like Azure Machine Learning Compute, and some other steps in the same pipeline could run in a different compute target.
The example belows show creating/retrieving/attaching to an Azure Machine Learning Compute instance.
List of Compute Targets on the workspace
Retrieve or create a Azure Machine Learning compute
Azure Machine Learning Compute is a service for provisioning and managing clusters of Azure virtual machines for running machine learning workloads. Let's create a new Azure Machine Learning Compute in the current workspace, if it doesn't already exist. We will then run the training script on this compute target.
Note that if you have an AzureML Data Scientist role, you will not have permission to create compute resources. Talk to your workspace or IT admin to create the compute targets described in this section, if they do not already exist.
If we could not find the compute with the given name in the previous cell, then we will create a new compute here. We will create an Azure Machine Learning Compute containing STANDARD_D2_V2 CPU VMs. This process is broken down into the following steps:
- Create the configuration
- Create the Azure Machine Learning compute
This process will take about 3 minutes and is providing only sparse output in the process. Please make sure to wait until the call returns before moving to the next cell.
Wait for this call to finish before proceeding (you will see the asterisk turning to a number).
Now that you have created the compute target, let's see what the workspace's compute_targets() function returns. You should now see one entry named 'amlcompute' of type AmlCompute.
Now that we have completed learning the basics of Azure Machine Learning (AML), let's go ahead and start understanding the Pipeline concepts.
Creating a Step in a Pipeline
A Step is a unit of execution. Step typically needs a target of execution (compute target), a script to execute, and may require script arguments and inputs, and can produce outputs. The step also could take a number of other parameters. Azure Machine Learning Pipelines provides the following built-in Steps:
- PythonScriptStep: Adds a step to run a Python script in a Pipeline.
- AdlaStep: Adds a step to run U-SQL script using Azure Data Lake Analytics.
- DataTransferStep: Transfers data between Azure Blob and Data Lake accounts.
- DatabricksStep: Adds a DataBricks notebook as a step in a Pipeline.
- HyperDriveStep: Creates a Hyper Drive step for Hyper Parameter Tuning in a Pipeline.
- AzureBatchStep: Creates a step for submitting jobs to Azure Batch
- EstimatorStep: Adds a step to run Estimator in a Pipeline.
- MpiStep: Adds a step to run a MPI job in a Pipeline.
- AutoMLStep: Creates a AutoML step in a Pipeline.
The following code will create a PythonScriptStep to be executed in the Azure Machine Learning Compute we created above using train.py, one of the files already made available in the source_directory.
A PythonScriptStep is a basic, built-in step to run a Python Script on a compute target. It takes a script name and optionally other parameters like arguments for the script, compute target, inputs and outputs. If no compute target is specified, default compute target for the workspace is used. You can also use a RunConfiguration to specify requirements for the PythonScriptStep, such as conda dependencies and docker image.
The best practice is to use separate folders for scripts and its dependent files for each step and specify that folder as the
source_directoryfor the step. This helps reduce the size of the snapshot created for the step (only the specific folder is snapshotted). Since changes in any files in thesource_directorywould trigger a re-upload of the snapshot, this helps keep the reuse of the step when there are no changes in thesource_directoryof the step.
Note: In the above call to PythonScriptStep(), the flag allow_reuse determines whether the step should reuse previous results when run with the same settings/inputs. This flag's default value is True; the default is set to True because, when inputs and parameters have not changed, we typically do not want to re-run a given pipeline step.
If allow_reuse is set to False, a new run will always be generated for this step during pipeline execution. The allow_reuse flag can come in handy in situations where you do not want to re-run a pipeline step.
Running a few steps in parallel
Here we are looking at a simple scenario where we are running a few steps (all involving PythonScriptStep) in parallel. Running nodes in parallel is the default behavior for steps in a pipeline.
We already have one step defined earlier. Let's define few more steps. For step3, we are using customized conda-dependency, and job might fail when "azureml-defaults" (or other meta package) is not in pip-package list. We need to be aware if we are not using any of the meta packages (azureml-sdk, azureml-defaults, azureml-core), and we recommend installing "azureml-defaults".
Build the pipeline
Once we have the steps (or steps collection), we can build the pipeline. By deafult, all these steps will run in parallel once we submit the pipeline for run.
A pipeline is created with a list of steps and a workspace. Submit a pipeline using submit. When submit is called, a PipelineRun is created which in turn creates StepRun objects for each step in the workflow.
Validate the pipeline
You have the option to validate the pipeline prior to submitting for run. The platform runs validation steps such as checking for circular dependencies and parameter checks etc. even if you do not explicitly call validate method.
Submit the pipeline
Submitting the pipeline involves creating an Experiment object and providing the built pipeline for submission.
Note: If regenerate_outputs is set to True, a new submit will always force generation of all step outputs, and disallow data reuse for any step of this run. Once this run is complete, however, subsequent runs may reuse the results of this run.
Examine the pipeline run
Use RunDetails Widget
We are going to use the RunDetails widget to examine the run of the pipeline. You can click each row below to get more details on the step runs.
Use Pipeline SDK objects
You can cycle through the node_run objects and examine job logs, stdout, and stderr of each of the steps.
Get additonal run details
If you wait until the pipeline_run is finished, you may be able to get additional details on the run. Since this is a blocking call, the following code is commented out.
Running a few steps in sequence
Now let's see how we run a few steps in sequence. We already have three steps defined earlier. Let's reuse those steps for this part.
We will reuse step1, step2, step3, but build the pipeline in such a way that we chain step3 after step2 and step2 after step1. Note that there is no explicit data dependency between these steps, but still steps can be made dependent by using the run_after construct.
Next: Pipelines with data dependency
The next notebook demostrates how to construct a pipeline with data dependency.