Notebooks
A
Azure
Train With Datasets

Train With Datasets

how-to-use-azuremlazure-mldata-sciencenotebookwork-with-datamachine-learningazure-machine-learningdeep-learningazuremlazure-ml-notebooksazuretrain-with-datasetsdatasets-tutorial

Copyright (c) Microsoft Corporation. All rights reserved.

Licensed under the MIT License.

Impressions

Train with Azure Machine Learning datasets

Datasets are categorized into TabularDataset and FileDataset based on how users consume them in training.

  • A TabularDataset represents data in a tabular format by parsing the provided file or list of files. TabularDataset can be created from csv, tsv, parquet files, SQL query results etc. For the complete list, please visit our documentation. It provides you with the ability to materialize the data into a pandas DataFrame.
  • A FileDataset references single or multiple files in your datastores or public urls. This provides you with the ability to download or mount the files to your compute. The files can be of any format, which enables a wider range of machine learning scenarios including deep learning.

In this tutorial, you will learn how to train with Azure Machine Learning datasets:

☑ Use datasets directly in your training script

☑ Use datasets to mount files to a remote compute

Prerequisites

If you are using an Azure Machine Learning Notebook VM, you are all set. Otherwise, go through the configuration notebook first if you haven't already established your connection to the AzureML Workspace.

[ ]

Initialize Workspace

Initialize a workspace object from persisted configuration.

[ ]

Create Experiment

Experiment is a logical container in an Azure ML Workspace. It hosts run records which can include run metrics and output artifacts from your experiments.

[ ]

Create or Attach existing compute resource

By using Azure Machine Learning Compute, a managed service, data scientists can train machine learning models on clusters of Azure virtual machines. Examples include VMs with GPU support. In this tutorial, you create Azure Machine Learning Compute as your training environment. The code below creates the compute clusters for you if they don't already exist in your workspace.

Note that if you have an AzureML Data Scientist role, you will not have permission to create compute resources. Talk to your workspace or IT admin to create the compute targets described in this section, if they do not already exist.

Creation of compute takes approximately 5 minutes. If the AmlCompute with that name is already in your workspace the code will skip the creation process.

[ ]

You now have the necessary packages and compute resources to train a model in the cloud.

Use datasets directly in training

Create a TabularDataset

By creating a dataset, you create a reference to the data source location. If you applied any subsetting transformations to the dataset, they will be stored in the dataset as well. The data remains in its existing location, so no extra storage cost is incurred.

Every workspace comes with a default datastore (and you can register more) which is backed by the Azure blob storage account associated with the workspace. We can use it to transfer data from local to the cloud, and create dataset from it. We will now upload the Iris data to the default datastore (blob) within your workspace.

[ ]

Then we will create an unregistered TabularDataset pointing to the path in the datastore. You can also create a dataset from multiple paths. learn more

TabularDataset represents data in a tabular format by parsing the provided file or list of files. This provides you with the ability to materialize the data into a Pandas or Spark DataFrame. You can create a TabularDataset object from .csv, .tsv, and parquet files, and from SQL query results. For a complete list, see TabularDatasetFactory class.

[ ]

Create a training script

To submit the job to the cluster, first create a training script. Run the following code to create the training script called train_titanic.py in the script_folder.

[ ]
[ ]

Create an environment

Define a conda environment YAML file with your training script dependencies and create an Azure ML environment.

[ ]
[ ]

Configure training run

A ScriptRunConfig object specifies the configuration details of your training job, including your training script, environment to use, and the compute target to run on. Specify the following in your script run configuration:

  • The directory that contains your scripts. All the files in this directory are uploaded into the cluster nodes for execution.
  • The training script name, train_iris.py
  • The input dataset for training, passed as an argument to your training script. as_named_input() is required so that the input dataset can be referenced by the assigned name in your training script.
  • The compute target. In this case you will use the AmlCompute you created
  • The environment definition for the experiment
[ ]

Submit job to run

Submit the ScriptRunConfig to the Azure ML experiment to kick off the execution.

[ ]
[ ]

Use datasets to mount files to a remote compute

You can use the Dataset object to mount or download files referred by it. When you mount a file system, you attach that file system to a directory (mount point) and make it available to the system. Because mounting load files at the time of processing, it is usually faster than download.
Note: mounting is only available for Linux-based compute (DSVM/VM, AMLCompute, HDInsights).

Upload data files into datastore

We will first load diabetes data from scikit-learn to the train-dataset folder.

[ ]

Now let's upload the 2 files into the default datastore under a path named diabetes:

[ ]

Create a FileDataset

FileDataset references single or multiple files in your datastores or public URLs. Using this method, you can download or mount the files to your compute as a FileDataset object. The files can be in any format, which enables a wider range of machine learning scenarios, including deep learning.

[ ]

Create a training script

To submit the job to the cluster, first create a training script. Run the following code to create the training script called train_diabetes.py in the script_folder.

[ ]

Configure & Run

Now configure your run. We will reuse the same sklearn_env environment from the previous run. Once the environment is built, and if you don't change your dependencies, it will be reused in subsequent runs.

We will pass in the DatasetConsumptionConfig of our FileDataset to the '--data-folder' argument of the script. Azure ML will resolve this to mount point of the data on the compute target, which we parse in the training script.

[ ]
[ ]

Display run results

You now have a model trained on a remote cluster. Retrieve all the metrics logged during the run, including the accuracy of the model:

[ ]

Register datasets

Use the register() method to register datasets to your workspace so they can be shared with others, reused across various experiments, and referred to by name in your training script.

[ ]

Register models with datasets

The last step in the training script wrote the model files in a directory named outputs in the VM of the cluster where the job is executed. outputs is a special directory in that all content in this directory is automatically uploaded to your workspace. This content appears in the run record in the experiment under your workspace. Hence, the model file is now also available in your workspace.

You can register models with datasets for reproducibility and auditing purpose.

[ ]
[ ]
[ ]