Notebooks
A
Azure
Tabular Dataset Partition Per Column

Tabular Dataset Partition Per Column

how-to-use-azuremlazure-mldata-sciencenotebookmachine-learningazure-machine-learningdeep-learningazuremlazure-ml-notebooksmachine-learning-pipelinesazureparallel-run

Copyright (c) Microsoft Corporation. All rights reserved. Licensed under the MIT License.

Impressions

Using Azure Machine Learning Pipelines for Batch Inference for tabular input partitioned by column value

In this notebook, we will demonstrate how to make predictions on large quantities of data asynchronously using the ML pipelines with Azure Machine Learning. Batch inference (or batch scoring) provides cost-effective inference, with unparalleled throughput for asynchronous applications. Batch prediction pipelines can scale to perform inference on terabytes of production data. Batch prediction is optimized for high throughput, fire-and-forget predictions for a large collection of data.

Tip If your system requires low-latency processing (to process a single document or small set of documents quickly), use real-time scoring instead of batch prediction.

This example will create a partitioned tabular dataset by splitting the rows in a large csv file by its value on specified column. Each partition will form up a mini-batch in the parallel processing procedure.

The outline of this notebook is as follows:

  • Create a tabular dataset partitioned by value on specified column.
  • Do batch inference on the dataset with each mini-batch corresponds to one partition.

Prerequisites

If you are using an Azure Machine Learning Notebook VM, you are all set. Otherwise, make sure you go through the configuration Notebook located at https://github.com/Azure/MachineLearningNotebooks first. This sets you up with a working config file that has information on your workspace, subscription id, etc.

Connect to workspace

[ ]
[ ]

Download OJ sales data from opendataset url

[ ]

Upload OJ sales data to datastore

[ ]

Create tabular dataset

Create normal tabular dataset

[ ]

Partition the tabular dataset

Partition the dataset by column 'store' and 'brand'. You can get a partition of data by specifying the value of one or more partition keys. E.g., by specifying store=1000 and brand='tropicana', you can get all the rows that matches this condition in the dataset.

[ ]

Create or Attach existing compute resource

[ ]

Intermediate/Output Data

[ ]

Calculate total revenue of each mini-batch partitioned by dataset partition key(s)

The script sum up the total revenue of a mini-batch.

[ ]

Build and run the batch inference pipeline

Specify the environment to run the script

You would need to specify the required private azureml packages in dependencies.

[ ]

Create the configuration to wrap the inference script

The parameter partition_keys is a list containing a subset of the dataset partition keys, specifying how is the input dataset partitioned. Each and every possible combination of values of partition_keys will form up a mini-batch. E.g., by specifying partition_keys=['store', 'brand'] will result in mini-batches like store=1000 && brand=tropicana, store=1000 && brand=dominicks, store=1001 && brand=dominicks, ...

[ ]

Create the pipeline step

[ ]

Run the pipeline

[ ]
[ ]

View the prediction results

In the total_income.py file above you can see that the ResultList with the filename and the prediction result gets returned. These are written to the DataStore specified in the PipelineData object as the output data, which in this case is called inferences. This containers the outputs from all of the worker nodes used in the compute cluster. You can download this data to view the results ... below just filters to the first 10 rows

[ ]
[ ]