Azure File Dataset Partition Per Folder

File Dataset Partition Per Folder

how-to-use-azuremlazure-mldata-sciencenotebookmachine-learningazure-machine-learningdeep-learningazuremlazure-ml-notebooksmachine-learning-pipelinesazureparallel-run

alph-notebooks/azure-ml-notebooks / file-dataset-partition-per-folder.ipynb

Export

Run Notebooks

Contents

No cells yet

Add cells to see them here

Impressions

Using Azure Machine Learning Pipelines for Batch Inference for files input partitioned by folder structure

In this notebook, we will demonstrate how to make predictions on large quantities of data asynchronously using the ML pipelines with Azure Machine Learning. Batch inference (or batch scoring) provides cost-effective inference, with unparalleled throughput for asynchronous applications. Batch prediction pipelines can scale to perform inference on terabytes of production data. Batch prediction is optimized for high throughput, fire-and-forget predictions for a large collection of data.

Tip If your system requires low-latency processing (to process a single document or small set of documents quickly), use real-time scoring instead of batch prediction.

This example will create a sample dataset with nested folder structure, where the folder name corresponds to the attribute of the files inside it. The Batch Inference job would split the files inside the dataset according to their attributes, so that all files with identical value on the specified attribute will form up a single mini-batch to be processed.

The outline of this notebook is as follows:

Create a dataset with nested folder structure and partition_format to interpret the folder structure into the attributes of files inside.
Do batch inference on each mini-batch defined by the folder structure.

Prerequisites

If you are using an Azure Machine Learning Notebook VM, you are all set. Otherwise, make sure you go through the configuration Notebook located at https://github.com/Azure/MachineLearningNotebooks first. This sets you up with a working config file that has information on your workspace, subscription id, etc.

Connect to workspace

[ ]

Upload local test data to datastore

The destination folder in the datastore is structured so that the name of each folder layer corresponds to a property of all the files inside the foler.

[ ]

Create partitioned file dataset

Create a file dataset partitioned by 'user', 'season', and 'genres', each corresponds to a folder layer specified in partition_format. You can get a partition of data by specifying the value of one or more partition keys. E.g., by specifying user=user1 and genres=piano, you can get all the file that matches dataset_partition_test/user1/*/piano.wav.

[ ]

Create or Attach existing compute resource

[ ]

Intermediate/Output Data

[ ]

Calculate total file size of each mini-batch partitioned by dataset partition key(s)

The script is to sum up the total size of files in each mini-batch.

[ ]

Build and run the batch inference pipeline

Specify the environment to run the script

You would need to specify the required private azureml packages in dependencies.

[ ]

Create the configuration to wrap the inference script

The parameter partition_keys is a list containing a subset of the dataset partition keys, specifying how is the input dataset partitioned. Each and every possible combination of values of partition_keys will form up a mini-batch. E.g., by specifying partition_keys=['user', 'genres'] will result in 5 mini-batches, i.e. user=halit && genres=disco, user=halit && genres=orchestra, user=chunyu && genres=piano, user=kin && genres=spirituality and user=ramandeep && genres=piano

[ ]

Create the pipeline step

[ ]

Run the pipeline

[ ]

View the prediction results

In the total_file_size.py file above you can see that the ResultList with the filename and the prediction result gets returned. These are written to the DataStore specified in the PipelineData object as the output data, which in this case is called inferences. This containers the outputs from all of the worker nodes used in the compute cluster. You can download this data to view the results ... below just filters to the first 10 rows

[ ]