Aml Pipelines Data Transfer
Copyright (c) Microsoft Corporation. All rights reserved.
Licensed under the MIT License.
![]()
Azure Machine Learning Pipeline with DataTransferStep
This notebook is used to demonstrate the use of DataTransferStep in an Azure Machine Learning Pipeline.
Note: In Azure Machine Learning, you can write output data directly to Azure Blob Storage, Azure Data Lake Storage Gen 1, Azure Data Lake Storage Gen 2, Azure FileShare without going through extra DataTransferStep. Learn how to use OutputFileDatasetConfig to achieve that with sample notebooks here.**
In certain cases, you will need to transfer data from one data location to another. For example, your data may be in Azure SQL Database and you may want to move it to Azure Data Lake storage. Or, your data is in an ADLS account and you want to make it available in the Blob storage. The built-in DataTransferStep class helps you transfer data in these situations.
The below examples show how to move data between different storage types supported in Azure Machine Learning.
Data transfer currently supports following storage types:
| Data store | Supported as a source | Supported as a sink |
|---|---|---|
| Azure Blob Storage | Yes | Yes |
| Azure Data Lake Storage Gen 2 | Yes | Yes |
| Azure SQL Database | Yes | Yes |
| Azure Database for PostgreSQL | Yes | Yes |
| Azure Database for MySQL | Yes | Yes |
Azure Machine Learning and Pipeline SDK-specific imports
Initialize Workspace
Initialize a workspace object from persisted configuration. If you are using an Azure Machine Learning Notebook VM, you are all set. Otherwise, make sure the config file is present at .\config.json
If you don't have a config.json file, please go through the configuration Notebook first.
This sets you up with a working config file that has information on your workspace, subscription id, etc.
Register Datastores and create DataReferences
For background on registering your data store, consult this article:
https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-access-data
Please make sure to update the following code examples with appropriate values.
Azure Blob Storage
Since Blob Storage can contain a file and directory with the same name, you can use source_reference_type and destination_reference_type optional arguments in DataTransferStep constructor to explicitly specify whether you're referring to the file or the directory.
Azure Data Lake Storage Gen2
Please consult the following article for detailed steps on setting up service principal authentication and assigning correct permissions to Data lake Storage Gen2 account:
Azure SQL Database
For enabling service principal authentication for an Azure SQL Database, please follow this section in Azure Data Factory documentation: https://docs.microsoft.com/en-us/azure/data-factory/connector-azure-sql-database#service-principal-authentication
Note: When copying data to an Azure SQL Database, data will be appended to an existing table. We also expect the source file to have a header row and the names should exactly match with column names in destination table.
Azure Database for PostgreSQL
Azure Database for MySQL
Setup Data Factory Account
Create a DataTransferStep
DataTransferStep is used to transfer data between Azure Blob, Azure Data Lake Store, and Azure SQL database.
- name: Name of module
- source_data_reference: Input connection that serves as source of data transfer operation.
- destination_data_reference: Input connection that serves as destination of data transfer operation.
- compute_target: Azure Data Factory to use for transferring data.
- allow_reuse: Whether the step should reuse results of previous DataTransferStep when run with same inputs. Set as False to force data to be transferred again.
Optional arguments to explicitly specify whether a path corresponds to a file or a directory. These are useful when storage contains both file and directory with the same name or when creating a new destination path.
- source_reference_type: An optional string specifying the type of source_data_reference. Possible values include: 'file', 'directory'. When not specified, we use the type of existing path or directory if it's a new path.
- destination_reference_type: An optional string specifying the type of destination_data_reference. Possible values include: 'file', 'directory'. When not specified, we use the type of existing path or directory if it's a new path.
Build and Submit the Experiment
View Run Details
Next: Databricks as a Compute Target
To use Databricks as a compute target from Azure Machine Learning Pipeline, a DatabricksStep is used. This notebook demonstrates the use of a DatabricksStep in an Azure Machine Learning Pipeline.