Notebooks
A
Azure
Automl Databricks Local 01

Automl Databricks Local 01

how-to-use-azuremlazure-mldata-sciencenotebookmachine-learningazure-databricksazure-machine-learningautomldeep-learningazuremlazure-ml-notebooksazure

Copyright (c) Microsoft Corporation. All rights reserved.

Licensed under the MIT License.

AutoML Installation

For Databricks non ML runtime 7.1(scala 2.21, spark 3.0.0) and up, Install AML sdk by running the following command in the first cell of the notebook.

%pip install --upgrade --force-reinstall -r https://aka.ms/automl_linux_requirements.txt

For Databricks non ML runtime 7.0 and lower, Install AML sdk using init script as shown in readme before running this notebook.

AutoML : Classification with Local Compute on Azure DataBricks

In this example we use the scikit-learn's to showcase how you can use AutoML for a simple classification problem.

In this notebook you will learn how to:

  1. Create Azure Machine Learning Workspace object and initialize your notebook directory to easily reload this object from a configuration file.
  2. Create an Experiment in an existing Workspace.
  3. Configure AutoML using AutoMLConfig.
  4. Train the model using AzureDataBricks.
  5. Explore the results.
  6. Test the best fitted model.

Prerequisites: Before running this notebook, please follow the readme for installing necessary libraries to your cluster.

Register Machine Learning Services Resource Provider

Microsoft.MachineLearningServices only needs to be registed once in the subscription. To register it: Start the Azure portal. Select your All services and then Subscription. Select the subscription that you want to use. Click on Resource providers Click the Register link next to Microsoft.MachineLearningServices

Check the Azure ML Core SDK Version to Validate Your Installation

[ ]

Initialize an Azure ML Workspace

What is an Azure ML Workspace and Why Do I Need One?

An Azure ML workspace is an Azure resource that organizes and coordinates the actions of many other Azure resources to assist in executing and sharing machine learning workflows. In particular, an Azure ML workspace coordinates storage, databases, and compute resources providing added functionality for machine learning experimentation, operationalization, and the monitoring of operationalized models.

What do I Need?

To create or access an Azure ML workspace, you will need to import the Azure ML library and specify following information:

  • A name for your workspace. You can choose one.
  • Your subscription id. Use the id value from the az account show command output above.
  • The resource group name. The resource group organizes Azure resources and provides a default region for the resources in the group. The resource group will be created if it doesn't exist. Resource groups can be created and viewed in the Azure portal
  • Supported regions include eastus2, eastus,westcentralus, southeastasia, westeurope, australiaeast, westus2, southcentralus.
[ ]

Creating a Workspace

If you already have access to an Azure ML workspace you want to use, you can skip this cell. Otherwise, this cell will create an Azure ML workspace for you in the specified subscription, provided you have the correct permissions for the given subscription_id.

This will fail when:

  1. The workspace already exists.
  2. You do not have permission to create a workspace in the resource group.
  3. You are not a subscription owner or contributor and no Azure ML workspaces have ever been created in this subscription.

If workspace creation fails for any reason other than already existing, please work with your IT administrator to provide you with the appropriate permissions or to provision the required resources.

Note: Creation of a new workspace can take several minutes.

[ ]

Configuring Your Local Environment

You can validate that you have access to the specified workspace and write a configuration file to the default configuration location, ./aml_config/config.json.

[ ]

Create an Experiment

As part of the setup you have already created an Azure ML Workspace object. For AutoML you will need to create an Experiment object, which is a named object in a Workspace used to run experiments.

[ ]
[ ]

Load Training Data Using Dataset

Automated ML takes a TabularDataset as input.

You are free to use the data preparation libraries/tools of your choice to do the require preparation and once you are done, you can write it to a datastore and create a TabularDataset from it.

[ ]

Review the TabularDataset

You can peek the result of a TabularDataset at any range using skip(i) and take(j).to_pandas_dataframe(). Doing so evaluates only j records for all the steps in the TabularDataset, which makes it fast even against large datasets.

[ ]

Configure AutoML

Instantiate an AutoMLConfig object to specify the settings and data used to run the experiment.

PropertyDescription
taskclassification or regression
primary_metricThis is the metric that you want to optimize. Classification supports the following primary metrics:
accuracy
AUC_weighted
average_precision_score_weighted
norm_macro_recall
precision_score_weighted
primary_metricThis is the metric that you want to optimize. Regression supports the following primary metrics:
spearman_correlation
normalized_root_mean_squared_error
r2_score
normalized_mean_absolute_error
iteration_timeout_minutesTime limit in minutes for each iteration.
iterationsNumber of iterations. In each iteration AutoML trains a specific pipeline with the data.
spark_contextSpark Context object. for Databricks, use spark_context=sc
max_concurrent_iterationsMaximum number of iterations to execute in parallel. This should be <= number of worker nodes in your Azure Databricks cluster.
n_cross_validationsNumber of cross validation splits.
training_dataInput dataset, containing both features and label column.
label_column_nameThe name of the label column.
[ ]

Train the Models

Call the submit method on the experiment object and pass the run configuration. Execution of local runs is synchronous. Depending on the data and the number of iterations this can run for a while.

[ ]

Explore the Results

Portal URL for Monitoring Runs

The following will provide a link to the web interface to explore individual run details and status. In the future we might support output displayed in the notebook.

[ ]

Deploy

Retrieve the Best Model

Below we select the best pipeline from our iterations. The get_output method on automl_classifier returns the best run and the fitted model for the last invocation. Overloads on get_output allow you to retrieve the best run and fitted model for any logged metric or for a particular iteration.

[ ]

Test the Best Fitted Model

Load Test Data - you can split the dataset beforehand & pass Train dataset to AutoML and use Test dataset to evaluate the best model.

[ ]

Testing Our Best Fitted Model

We will try to predict digits and see how our model works. This is just an example to show you.

[ ]

Impressions