Tabular Dataset Inference Iris
Copyright (c) Microsoft Corporation. All rights reserved. Licensed under the MIT License.
![]()
Using Azure Machine Learning Pipelines for Batch Inference for CSV Files
In this notebook, we will demonstrate how to make predictions on large quantities of data asynchronously using the ML pipelines with Azure Machine Learning. Batch inference (or batch scoring) provides cost-effective inference, with unparalleled throughput for asynchronous applications. Batch prediction pipelines can scale to perform inference on terabytes of production data. Batch prediction is optimized for high throughput, fire-and-forget predictions for a large collection of data.
Tip If your system requires low-latency processing (to process a single document or small set of documents quickly), use real-time scoring instead of batch prediction.
In this example we will take use a machine learning model already trained to predict different types of iris flowers and run that trained model on some of the data in a CSV file which has characteristics of different iris flowers. However, the same example can be extended to manipulating data to any embarrassingly-parallel processing through a python script.
The outline of this notebook is as follows:
- Create a DataStore referencing the CSV files stored in a blob container.
- Register the pretrained model into the model registry.
- Use the registered model to do batch inference on the CSV files in the data blob container.
Prerequisites
If you are using an Azure Machine Learning Notebook VM, you are all set. Otherwise, make sure you go through the configuration Notebook located at https://github.com/Azure/MachineLearningNotebooks first. This sets you up with a working config file that has information on your workspace, subscription id, etc.
Connect to workspace
Create a workspace object from the existing workspace. Workspace.from_config() reads the file config.json and loads the details into an object named ws.
Create or Attach existing compute resource
By using Azure Machine Learning Compute, a managed service, data scientists can train machine learning models on clusters of Azure virtual machines. Examples include VMs with GPU support. In this tutorial, you create Azure Machine Learning Compute as your training environment. The code below creates the compute clusters for you if they don't already exist in your workspace.
Note that if you have an AzureML Data Scientist role, you will not have permission to create compute resources. Talk to your workspace or IT admin to create the compute targets described in this section, if they do not already exist.
Creation of compute takes approximately 5 minutes. If the AmlCompute with that name is already in your workspace the code will skip the creation process.
Create a datastore containing sample images
The input dataset used for this notebook is CSV data which has attributes of different iris flowers. We have created a public blob container sampledata on an account named pipelinedata, containing iris data set. In the next step, we create a datastore with the name iris_datastore, which points to this container. In the call to register_azure_blob_container below, setting the overwrite flag to True overwrites any datastore that was created previously with that name.
This step can be changed to point to your blob container by providing your own datastore_name, container_name, and account_name.
Create a TabularDataset
A TabularDataSet references single or multiple files which contain data in a tabular structure (ie like CSV files) in your datastores or public urls. TabularDatasets provides you with the ability to download or mount the files to your compute. By creating a dataset, you create a reference to the data source location. If you applied any subsetting transformations to the dataset, they will be stored in the dataset as well. The data remains in its existing location, so no extra storage cost is incurred. You can use dataset objects as inputs. Register the datasets to the workspace if you want to reuse them later.
Intermediate/Output Data
Intermediate data (or output of a Step) is represented by PipelineData object. PipelineData can be produced by one step and consumed in another step by providing the PipelineData object as an output of one step and the input of one or more steps.
Registering the Model with the Workspace
Get the pretrained model from a publicly available Azure Blob container, then register it to use in your workspace
Using your model to make batch predictions
To use the model to make batch predictions, you need an entry script and a list of dependencies:
An entry script
This script accepts requests, scores the requests by using the model, and returns the results.
- init() - Typically this function loads the model into a global object. This function is run only once at the start of batch processing per worker node/process. init method can make use of following environment variables (ParallelRunStep input):
- AZUREML_BI_OUTPUT_PATH - output folder path
- run(mini_batch) - The method to be parallelized. Each invocation will have one minibatch.
mini_batch: Batch inference will invoke run method and pass either a list or Pandas DataFrame as an argument to the method. Each entry in min_batch will be - a filepath if input is a FileDataset, a Pandas DataFrame if input is a TabularDataset.
run method response: run() method should return a Pandas DataFrame or an array. For append_row output_action, these returned elements are appended into the common output file. For summary_only, the contents of the elements are ignored. For all output actions, each returned output element indicates one successful inference of input element in the input mini-batch. User should make sure that enough data is included in inference result to map input to inference. Inference output will be written in output file and not guaranteed to be in order, user should use some key in the output to map it to input.
Dependencies
Helper scripts or Python/Conda packages required to run the entry script.
Print inferencing script
Build and run the batch inference pipeline
The data, models, and compute resource are now available. Let's put all these together in a pipeline.
Specify the environment to run the script
Specify the conda dependencies for your script. This will allow us to install pip packages as well as configure the inference environment.
- Always include azureml-core and azureml-dataset-runtime[fuse] in the pip package list to make ParallelRunStep run properly.
- For TabularDataset, add pandas as
run(mini_batch)usespandas.DataFrameas mini_batch type.
If you're using custom image (batch_env.python.user_managed_dependencies = True), you need to install the package to your image.
Create the configuration to wrap the inference script
Create the pipeline step
Create the pipeline step using the script, environment configuration, and parameters. Specify the compute target you already attached to your workspace as the target of execution of the script. We will use ParallelRunStep to create the pipeline step.
Run the pipeline
At this point you can run the pipeline and examine the output it produced. The Experiment object is used to track the run of the pipeline
View progress of Pipeline run
The pipeline run status could be checked in Azure Machine Learning portal (https://ml.azure.com). The link to the pipeline run could be retrieved by inspecting the pipeline_run object.
Optional: View detailed logs (streaming)
View Results
In the iris_score.py file above you can see that the Result with the prediction of the iris variety gets returned and then appended to the original input of the row from the csv file. These results are written to the DataStore specified in the PipelineData object as the output data, which in this case is called inferences. This contains the outputs from all of the worker nodes used in the compute cluster. You can download this data to view the results ... below just filters to a random 20 rows
Cleanup compute resources
For re-occurring jobs, it may be wise to keep compute the compute resources and allow compute nodes to scale down to 0. However, since this is just a single run job, we are free to release the allocated compute resources.