Tf Dali Ensemble Cv
Ensemble model inference with NVIDIA Triton Inference Server and NVIDIA DALI on Amazon SageMaker
This notebook's CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook.
Deep learning applications are often complex, requiring multi-stage data loading and pre-processing pipelines. Optimizing these pre-processing steps are critical to achieve best performing inference workloads. In a computer vision application, pre-processing pipelines may include steps like image loading, cropping, image decoding, image resizing and other image augmentations. These data processing pipelines can be a bottleneck, limiting the performance and scalability of deep learning inference. Additionally, these pre-processing implementations can result in challenges like portability of inference workloads and code maintainability.
In this notebook, we will deep dive into NVIDIA DALI pre-processing pipeline implementation for Inception V3 model. Pipeline implements image pre-processing steps like resize, decoder and crop. Serialize the pipeline and create a model configuration to be deployed with NVIDIA Triton Inference server. Finally, we deploy the Inception V3 model to an Amazon SageMaker real time endpoint using Triton Inference Deep Learning containers.
NVIDIA DALI
The NVIDIA Data Loading Library (DALI) is a library for data loading and pre-processing to accelerate deep learning applications. It provides a collection of highly optimized building blocks for loading and processing image, video and audio data. It can be used as a portable drop-in replacement for built in data loaders and data iterators in popular deep learning frameworks.
DALI addresses the problem of the CPU bottleneck by offloading data preprocessing to the GPU. Additionally, DALI relies on its own execution engine, built to maximize the throughput of the input pipeline. Features such as prefetching, parallel execution, and batch processing are handled transparently for the user. Data processing pipelines implemented using DALI are portable because they can easily be retargeted to TensorFlow, PyTorch, MXNet and PaddlePaddle.
Highlights
-
Easy integration with NVIDIA Triton Inference Serve
-
Multiple data formats support - RecordIO, TFRecord, COCO, JPEG etc
-
Portable across popular deep learning frameworks: TensorFlow, PyTorch, MXNet.
-
Supports CPU and GPU execution.
-
Scalable across multiple GPUs.
-
Flexible graphs let developers create custom pipelines.
Triton Model Ensembles
Triton Inference Server greatly simplifies the deployment of AI models at scale in production. Triton Server comes with a convenient solution that simplifies building pre-processing and post-processing pipelines. Triton Server platform provides the ensemble scheduler, which is responsible for pipelining models participating in the inference process while ensuring efficiency and optimizing throughput.

Set up
Install the dependencies required to package the model and run inferences using SageMaker Triton server.
Execute the following command to install the latest DALI for specified CUDA version
Note: We are installing NVIDIA DALI Cuda in the below step. You need to execute this notebook on a GPU based instance.
Imports
Variables
Download models and set up pre-processing pipeline with DALI
Create directories to host DALI ensemble models into the model repository. The following example shows the model repository directory structure, containing a DALI preprocessing model, TensorFlow Inception v3 model, and the model ensemble

Next, we will download Inception V3 model, this is an image classification neural network model
Place the downloaded Inception V3 model in model repository under inception_graphdef folder
Model configuration of ensemble model for image classification and dali pre-processing is shown below
Model configuration for dali backend
Model configurations containing inception model graph definition
We will copy the inception classification model labels to inception_graphdef directory in model repository. The labels file contain 1000 class labels of ImageNet classification dataset.
DALI Pipeline
In DALI, any data processing task has a central object called Pipeline. Pipeline object is an instance of nvidia.dali.Pipeline. Pipeline encapsulates the data processing graph and the execution engine. You can define a DALI pipeline by implementing a function that uses DALI operators inside and decorating it with the pipeline_def() decorator.
DALI pipelines are executed in stages. The stages correspond to the device parameter that can be specified for the operator, and are executed in following order:
-
'cpu' - operators that accept CPU inputs and produce CPU outputs.
-
'mixed' - operators that accept CPU inputs and produce GPU outputs, for example nvidia.dali.fn.decoders.image().
-
'gpu' - operators that accept GPU inputs and produce GPU outputs.
Parameters
- batch_size - Maximum batch size of the pipeline
- num_threads - Number of CPU threads used by the pipeline
- device_id - Id of GPU used by the pipeline
Serialize the pipeline to a Protobuf string, filename is the File where serialized pipeline will be written
Get Triton Inference Server Container image
Now that we have set up the DALI pipelines, we will get the SageMaker Triton image from ECR and use it to deploy the Inception V3 model to Amazon SageMaker real time endpoint
Let's create the model artifact
Once the content of the model repository directory tar'd to model.tar.gz file, we will upload the model artifacts to model_uri S3 location
Create SageMaker Endpoint
We start off by creating a SageMaker model from the model artifacts we uploaded to s3 in the previous step.
In this step we also provide an additional Environment Variable i.e. SAGEMAKER_TRITON_DEFAULT_MODEL_NAME which specifies the name of the model to be loaded by Triton. The value of this key should match the folder name in the model package uploaded to s3. This variable is optional in case of a single model. In case of ensemble models, this key has to be specified for Triton to startup in SageMaker.
Additionally, customers can set SAGEMAKER_TRITON_BUFFER_MANAGER_THREAD_COUNT and SAGEMAKER_TRITON_THREAD_COUNT for optimizing the thread counts.
Note: The current release of Triton (21.08-py3) on SageMaker doesn't support running instances of different models on the same server, except in case of ensembles. Only multiple model instances of the same model are supported, which can be specified under the instance-groups section of the config.pbtxt file.
Using the model above, we create an endpoint configuration where we can specify the type and number of instances we want in the endpoint.
Using the above endpoint configuration we create a new sagemaker endpoint and wait for the deployment to finish. The status will change to InService once the deployment is successful.
Prepare inference payload
Let's download an image from SageMaker S3 bucket to be used for Inception V3 model inference. This image will go through pre-processing DALI pipeline and used in ensemble scheduler provided by Triton Inference server.
Prepare input payload with the name, shape, datatype and the data as list. This payload will be used to invoke the endpoint to get the prediction results
Run inference
Once we have the endpoint running we can use the sample image downloaded to do an inference using json as the payload format. For inference request format, Triton uses the KFServing community standard inference protocols.
We can also use binary+json as the payload format to get better performance for the inference call. The specification of this format is provided here.
Note: With the binary+json format, we have to specify the length of the request metadata in the header to allow Triton to correctly parse the binary payload. This is done using a custom Content-Type header application/vnd.sagemaker-triton.binary+json;json-header-size={}.
Please not, this is different from using Inference-Header-Content-Length header on a stand-alone Triton server since custom headers are not allowed in SageMaker.
The tritonclient package provides utility methods to generate the payload without having to know the details of the specification. We'll use the following methods to convert our inference request into a binary format which provides lower latencies for inference.
We use invoke_endpoint to pass in the payload in binary json format to the endpoint.
Delete endpoint and model artifacts
Finally, we clean up the model artifacts i.e. SageMaker model, endpoint configuration and the endpoint.
Conclusion
In this notebook, we implemented a model ensemble using NIVIDA Triton inference server and pre-processed images using NVIDIA DALI pipelines. This significantly accelerates model inference in terms of overall latency and throughput. Try it out!
Notebook CI Test Results
This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.