Notebooks
A
Azure
Aml Pipelines Use Adla As Compute Target

Aml Pipelines Use Adla As Compute Target

how-to-use-azuremlazure-mldata-sciencenotebookintro-to-pipelinesmachine-learningazure-machine-learningdeep-learningazuremlazure-ml-notebooksmachine-learning-pipelinesazure

Copyright (c) Microsoft Corporation. All rights reserved.
Licensed under the MIT License.

Impressions

AML Pipeline with AdlaStep

This notebook is used to demonstrate the use of AdlaStep in AML Pipelines. AdlaStep is used to run U-SQL scripts using Azure Data Lake Analytics service.

AML and Pipeline SDK-specific imports

[ ]

Initialize Workspace

Initialize a workspace object from persisted configuration. If you are using an Azure Machine Learning Notebook VM, you are all set. Otherwise, make sure you go through the configuration Notebook first if you haven't.

[ ]

Attach ADLA account to workspace

To submit jobs to Azure Data Lake Analytics service, you must first attach your ADLA account to the workspace. You'll need to provide the account name and resource group of ADLA account to complete this part.

[ ]

Register Data Lake Storage as Datastore

To register Data Lake Storage as Datastore in workspace, you'll need account information like account name, resource group and subscription Id.

AdlaStep can only work with data stored in the default Data Lake Storage of the Data Lake Analytics account provided above. If the data you need to work with is in a non-default storage, you can use a DataTransferStep to copy the data before training. You can find the default storage by opening your Data Lake Analytics account in Azure portal and then navigating to 'Data sources' item under Settings in the left pane.

Grant Azure AD application access to Data Lake Storage

You'll also need to provide an Active Directory application which can access Data Lake Storage. This document contains step-by-step instructions on how to create an AAD application and assign to Data Lake Storage. Couple of important notes when assigning permissions to AAD app:

  • Access should be provided at root folder level.
  • In 'Assign permissions' pane, select Read, Write, and Execute permissions for 'This folder and all children'. Add as 'An access permission entry and a default permission entry' to make sure application can access any new files created in the future.
[ ]

Setup inputs and outputs

For purpose of this demo, we're going to execute a simple U-SQL script that reads a CSV file and writes portion of content to a new text file. First, let's create our sample input which contains 3 columns: employee Id, name and department Id.

[ ]
[ ]

Upload this file to Data Lake Storage at location adla_sample/sample_input.csv and create a DataReference to refer to this file.

[ ]

Create PipelineData object to store output produced by AdlaStep.

[ ]

Write your U-SQL script

Now let's write a U-Sql script that reads above CSV file and writes the name column to a new file.

Instead of hard-coding paths in your script, you can use @@name@@ syntax to refer to inputs, outputs, and parameters.

  • If name is the name of an input or output port binding, any occurrences of @@name@@ in the script are replaced with actual data path of corresponding port binding.
  • If name matches any key in the params dictionary, any occurrences of @@name@@ will be replaced with corresponding value in the dictionary.

Note the use of @@ syntax in the below script. Before submitting the job to Data Lake Analytics service, @@emplyee_data@@ will be replaced with actual path of sample_input.csv in Data Lake Storage. Similarly, @@sample_output@@ will be replaced with a path in Data Lake Storage which will be used to store intermediate output produced by the step.

[ ]

Create an AdlaStep

AdlaStep is used to run U-SQL script using Azure Data Lake Analytics.

  • name: Name of module
  • script_name: name of U-SQL script file
  • inputs: List of input port bindings
  • outputs: List of output port bindings
  • compute_target: the ADLA compute to use for this job
  • params: Dictionary of name-value pairs to pass to U-SQL job (optional)
  • degree_of_parallelism: the degree of parallelism to use for this job (optional)
  • priority: the priority value to use for the current job (optional)
  • runtime_version: the runtime version of the Data Lake Analytics engine (optional)
  • source_directory: folder that contains the script, assemblies etc. (optional)
  • hash_paths: list of paths to hash to detect a change (script file is always hashed) (optional)

The best practice is to use separate folders for scripts and its dependent files for each step and specify that folder as the source_directory for the step. This helps reduce the size of the snapshot created for the step (only the specific folder is snapshotted). Since changes in any files in the source_directory would trigger a re-upload of the snapshot, this helps keep the reuse of the step when there are no changes in the source_directory of the step.

[ ]

Build and Submit the Experiment

[ ]

View Run Details

[ ]