Azure Aml Pipelines Data Transfer

Aml Pipelines Data Transfer

how-to-use-azuremlazure-mldata-sciencenotebookintro-to-pipelinesmachine-learningazure-machine-learningdeep-learningazuremlazure-ml-notebooksmachine-learning-pipelinesazure

alph-notebooks/azure-ml-notebooks / aml-pipelines-data-transfer.ipynb

Export

Run Notebooks

Contents

No cells yet

Add cells to see them here

Impressions

Azure Machine Learning Pipeline with DataTransferStep

This notebook is used to demonstrate the use of DataTransferStep in an Azure Machine Learning Pipeline.

Note: In Azure Machine Learning, you can write output data directly to Azure Blob Storage, Azure Data Lake Storage Gen 1, Azure Data Lake Storage Gen 2, Azure FileShare without going through extra DataTransferStep. Learn how to use OutputFileDatasetConfig to achieve that with sample notebooks here.**

In certain cases, you will need to transfer data from one data location to another. For example, your data may be in Azure SQL Database and you may want to move it to Azure Data Lake storage. Or, your data is in an ADLS account and you want to make it available in the Blob storage. The built-in DataTransferStep class helps you transfer data in these situations.

The below examples show how to move data between different storage types supported in Azure Machine Learning.

Data transfer currently supports following storage types:

Data store	Supported as a source	Supported as a sink
Azure Blob Storage	Yes	Yes
Azure Data Lake Storage Gen 2	Yes	Yes
Azure SQL Database	Yes	Yes
Azure Database for PostgreSQL	Yes	Yes
Azure Database for MySQL	Yes	Yes

Azure Machine Learning and Pipeline SDK-specific imports

[ ]

Initialize Workspace

Initialize a workspace object from persisted configuration. If you are using an Azure Machine Learning Notebook VM, you are all set. Otherwise, make sure the config file is present at .\config.json

If you don't have a config.json file, please go through the configuration Notebook first.

This sets you up with a working config file that has information on your workspace, subscription id, etc.

[ ]

Register Datastores and create DataReferences

For background on registering your data store, consult this article:

https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-access-data

Please make sure to update the following code examples with appropriate values.

Azure Blob Storage

Since Blob Storage can contain a file and directory with the same name, you can use source_reference_type and destination_reference_type optional arguments in DataTransferStep constructor to explicitly specify whether you're referring to the file or the directory.

[ ]

Azure Data Lake Storage Gen2

Please consult the following article for detailed steps on setting up service principal authentication and assigning correct permissions to Data lake Storage Gen2 account:

https://docs.microsoft.com/en-us/azure/data-factory/connector-azure-data-lake-storage#service-principal-authentication

[ ]

Azure SQL Database

For enabling service principal authentication for an Azure SQL Database, please follow this section in Azure Data Factory documentation: https://docs.microsoft.com/en-us/azure/data-factory/connector-azure-sql-database#service-principal-authentication

Note: When copying data to an Azure SQL Database, data will be appended to an existing table. We also expect the source file to have a header row and the names should exactly match with column names in destination table.

[ ]

Azure Database for PostgreSQL

[ ]

Azure Database for MySQL

[ ]

Setup Data Factory Account

[ ]

Create a DataTransferStep

DataTransferStep is used to transfer data between Azure Blob, Azure Data Lake Store, and Azure SQL database.

name: Name of module
source_data_reference: Input connection that serves as source of data transfer operation.
destination_data_reference: Input connection that serves as destination of data transfer operation.
compute_target: Azure Data Factory to use for transferring data.
allow_reuse: Whether the step should reuse results of previous DataTransferStep when run with same inputs. Set as False to force data to be transferred again.

Optional arguments to explicitly specify whether a path corresponds to a file or a directory. These are useful when storage contains both file and directory with the same name or when creating a new destination path.

source_reference_type: An optional string specifying the type of source_data_reference. Possible values include: 'file', 'directory'. When not specified, we use the type of existing path or directory if it's a new path.
destination_reference_type: An optional string specifying the type of destination_data_reference. Possible values include: 'file', 'directory'. When not specified, we use the type of existing path or directory if it's a new path.

[ ]

Build and Submit the Experiment

[ ]

View Run Details

[ ]

Next: Databricks as a Compute Target

To use Databricks as a compute target from Azure Machine Learning Pipeline, a DatabricksStep is used. This notebook demonstrates the use of a DatabricksStep in an Azure Machine Learning Pipeline.