Azure Aml Pipelines With Data Dependency Steps

Aml Pipelines With Data Dependency Steps

how-to-use-azuremlazure-mldata-sciencenotebookintro-to-pipelinesmachine-learningazure-machine-learningdeep-learningazuremlazure-ml-notebooksmachine-learning-pipelinesazure

alph-notebooks/azure-ml-notebooks / aml-pipelines-with-data-dependency-steps.ipynb

Export

Run Notebooks

Contents

No cells yet

Add cells to see them here

Impressions

Azure Machine Learning Pipelines with Data Dependency

In this notebook, we will see how we can build a pipeline with implicit data dependency.

Prerequisites and Azure Machine Learning Basics

If you are using an Azure Machine Learning Notebook VM, you are all set. Otherwise, make sure you go through the configuration Notebook first if you haven't. This sets you up with a working config file that has information on your workspace, subscription id, etc.

Azure Machine Learning and Pipeline SDK-specific Imports

[ ]

Initialize Workspace

Initialize a [workspace](https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.workspace(class%29) object from persisted configuration.

[ ]

Source Directory

The best practice is to use separate folders for scripts and its dependent files for each step and specify that folder as the source_directory for the step. This helps reduce the size of the snapshot created for the step (only the specific folder is snapshotted). Since changes in any files in the source_directory would trigger a re-upload of the snapshot, this helps keep the reuse of the step when there are no changes in the source_directory of the step.

[ ]

Required data and script files for the the tutorial

Sample files required to finish this tutorial are already copied to the project folder specified above. Even though the .py provided in the samples don't have much "ML work," as a data scientist, you will work on this extensively as part of your work. To complete this tutorial, the contents of these files are not very important. The one-line files are for demostration purpose only.

Compute Targets

See the list of Compute Targets on the workspace.

[ ]

Retrieve or create an Aml compute

Azure Machine Learning Compute is a service for provisioning and managing clusters of Azure virtual machines for running machine learning workloads. Let's get the default Aml Compute in the current workspace. We will then run the training script on this compute target.

Note that if you have an AzureML Data Scientist role, you will not have permission to create compute resources. Talk to your workspace or IT admin to create the compute targets described in this section, if they do not already exist.

[ ]

Wait for this call to finish before proceeding (you will see the asterisk turning to a number).

Now that you have created the compute target, let's see what the workspace's compute_targets() function returns. You should now see one entry named 'amlcompute' of type AmlCompute.

Building Pipeline Steps with Inputs and Outputs

As mentioned earlier, a step in the pipeline can take data as input. This data can be a data source that lives in one of the accessible data locations, or intermediate data produced by a previous step in the pipeline.

Datasources

Datasource is represented by DataReference object and points to data that lives in or is accessible from Datastore. DataReference could be a pointer to a file or a directory.

[ ]

Intermediate/Output Data

Intermediate data (or output of a Step) is represented by PipelineData object. PipelineData can be produced by one step and consumed in another step by providing the PipelineData object as an output of one step and the input of one or more steps.

Constructing PipelineData

name: [Required] Name of the data item within the pipeline graph
datastore_name: Name of the Datastore to write this output to
output_name: Name of the output
output_mode: Specifies "upload" or "mount" modes for producing output (default: mount)
output_path_on_compute: For "upload" mode, the path to which the module writes this output during execution
output_overwrite: Flag to overwrite pre-existing data

[ ]

Pipelines steps using datasources and intermediate data

Machine learning pipelines can have many steps and these steps could use or reuse datasources and intermediate data. Here's how we construct such a pipeline:

Define a Step that consumes a datasource and produces intermediate data.

In this step, we define a step that consumes a datasource and produces intermediate data.

Open train.py in the local machine and examine the arguments, inputs, and outputs for the script. That will give you a good sense of why the script argument names used below are important.

Specify conda dependencies and a base docker image through a RunConfiguration

This step uses a docker image and scikit-learn, use a RunConfiguration to specify these requirements and use when creating the PythonScriptStep.

[ ]

Define a Step that consumes intermediate data and produces intermediate data

In this step, we define a step that consumes an intermediate data and produces intermediate data.

Open extract.py in the local machine and examine the arguments, inputs, and outputs for the script. That will give you a good sense of why the script argument names used below are important.

[ ]

Define a Step that consumes intermediate data and existing data and produces intermediate data

In this step, we define a step that consumes multiple data types and produces intermediate data.

This step uses the output generated from the previous step as well as existing data on a DataStore. The location of the existing data is specified using a PipelineParameter and a DataPath. Using a PipelineParameter enables easy modification of the data location when the Pipeline is published and resubmitted.

Open compare.py in the local machine and examine the arguments, inputs, and outputs for the script. That will give you a good sense of why the script argument names used below are important.

[ ]

Build the pipeline

[ ]

Wait for pipeline run to complete

[ ]

See Outputs

See where outputs of each pipeline step are located on your datastore.

Wait for pipeline run to complete, to make sure all the outputs are ready

[ ]

Download Outputs

We can download the output of any step to our local machine using the SDK.

[ ]

Next: Publishing the Pipeline and calling it from the REST endpoint

See this notebook to understand how the pipeline is published and you can call the REST endpoint to run the pipeline.