Mme Triton Xgb Fil Ensemble
Pre-processing and XGBoost model inference pipeline with NVIDIA Triton Inference Server on Amazon SageMaker using Multi-model endpoint(MME)
This notebook's CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook.
With NVIDIA Triton container image on SageMaker you can now use Triton's Forest Inference Library (FIL) backend using SageMaker's Multi-model endpoint to easily serve multiple tree based ML models like XGBoost for high-performance CPU and GPU inference in SageMaker. Using Triton's FIL backend allows you to benefit from performance optimizations like dynamic batching and concurrent execution which help maximize the utilization of GPU and CPU, further lowering the cost of inference. The multi-framework support provided by NVIDIA Triton allows you to seamlessly deploy tree-based ML models alongside deep learning models for fast, unified inference pipelines.
Amazon SageMaker multi-model endpoints (MMEs) provide a scalable and cost-effective way to deploy a large number of deep learning models. MMEs are a popular hosting choice to host hundreds of CPU-based models among customers like Zendesk, Veeva, and AT&T. Previously, you had limited options to deploy hundreds of deep learning models that needed accelerated compute with GPUs. We announced MME support for GPU last year. Now you can deploy thousands of deep learning models behind one SageMaker endpoint. MMEs can now run multiple models on a GPU core, share GPU instances behind an endpoint across multiple models, and dynamically load and unload models based on the incoming traffic. With this, you can significantly save cost and achieve the best price performance.
Machine Learning applications are complex and can often require data pre-processing. In this notebook, we will not only deep dive into how to deploy a tree-based ML model like XGBoost using the FIL Backend in Triton on SageMaker endpoint but also cover how to implement python-based data pre-processing inference pipeline for your model using the ensemble feature in Triton. This will allow us to send in the raw data from client side and have both data pre-processing and model inference happen in Triton SageMaker endpoint for the optimal inference performance.
To Run This Notebook Please Select Python 3 (Data Science) Kernel from the Kernel Dropdown menu
Note: This notebook was tested with the Python 3 (Data Science) kernel on an Amazon SageMaker Studio instance of type ml.c5.xlarge.
The alternate Studio instance types - ml.c5.large, ml.c5.2xlarge
Forest Inference Library (FIL)
RAPIDS Forest Inference Library (FIL) is a library to provide high-performance inference for tree-based models. Here are some important FIL features:
- Supports XGBoost, LightGBM, cuML RandomForest, and Scikit Learn Random Forest
- No conversion needed for XGBoost and LightGBM. SKLearn or cuML pickle models need to be converted to Treelite's binary checkpoint format
- SKLearn Random Forest is supported for single-output regression and multi-class classification
- Both CPU and GPU are supported
Below we show benchmark highlighting FIL's throughput performance against CPU XGBoost.

Triton FIL Backend
FIL is available as a backend in Triton with features to allow for serving XGBoost, LightGBM and RandomForest models both on CPU and GPU with high performance. Here are some important features of the FIL Backend:
- Shapley Value Support (GPU): GPU Shapley Values are supported for Model Explainability
- Categorical Feature Support: Models trained on categorical features fully supported.
- CPU Optimizations: Optimized CPU mode offers faster execution than native XGBoost.
To learn more about FIL Backend's features please see the FAQ Notebook and Triton FIL Backend GitHub.
Triton Model Ensemble Feature
Triton Inference Server greatly simplifies the deployment of AI models at scale in production. Triton Server comes with a convenient solution that simplifies building pre-processing and post-processing pipelines. Triton Server platform provides the ensemble scheduler, which is responsible for pipelining models participating in the inference process while ensuring efficiency and optimizing throughput. Using ensemble models can avoid the overhead of transferring intermediate tensors and minimize the number of requests that must be sent to Triton.

In this notebook we will be show how to use the ensemble feature for building a pipeline of data preprocessing with XGBoost model inference and you can extrapolate from it to add custom postprocessing to the pipeline.
SageMaker MME supported Triton container runtime architecture with Triton FIL backend
Refer to this open source code for more information on how SageMaker implements multi-model endpoint contracts on top of open source Triton Inference server.

Set up Environment
We begin by setting up the required environment. We will install the dependencies required to package our model pipeline and run inferences using Triton server. Also define the IAM role that will give SageMaker access to the model artifacts and the NVIDIA Triton ECR image.
Set up pre-processing with Triton Python Backend
We will be using Triton's Python Backend to perform the some tabular data preprocessing (categotical encoding) during inference time for raw data requests coming into the server. For more information to see the preprocessing that was done during training feel free to take a look at the training notebook here.
The Python backend enables pre-process, post-processing and any other custom logic to be implemented in Python and served with Triton.
Using Triton on SageMaker requires us to first set up a model repository folder containing the models we want to serve. We have already set up model for python data preprocessing called preprocessing in the cpu_model_repository and cpu_model_repository

Now Triton has specific requirements for model repository layout. Within the top-level model repository directory each model has its own sub-directory containing the information for the corresponding model. Each model directory in Triton must have at least one numeric sub-directory representing a version of the model. Here that is 1 representing version 1 of our python preprocessing model. Each model is executed by a specific backend so within each version sub-directory there must be the model artifact required by that backend. Here, we are using the Python backend and it requires the python file you are serving to be called model.py and the file needs to implement certain functions. If we were using a PyTorch backend a model.pt file would be required and so on. For more details on naming conventions for model files please see the model files doc.
Our model.py python file we are using here implements all the tabular data preprocessing logic to convert raw data into features that can be fed into our XGBoost model.
Every Triton model must also provide a config.pbtxt file describing the model configuration. To learn more about the config settings please see model configuration doc. Our config.pbtxt specifies the backend as python and specifies all the input columns for raw data along with preprocessed output that consists of 15 features. We also specify we want to run this python preprocessing model on the CPU.
Create Conda Env for Preprocessing Dependencies
The Python backend in Triton requires us to use conda environment for any additional dependencies. In this case we are using the Python backend to do preprocessing of the raw data before feeding it into the XGBoost model being run in FIL Backend. Even though we originally used RAPIDS cuDF and cuML to do the data preprocessing here we use Pandas and Scikit-learn as preprocessing dependencies for inference time. We do this for three reasons.
- Firstly, to show how to create conda environment for your dependencies and how to package it in format expected by Triton's Python backend.
- Secondly, by showing the preprocessing model running in Python backend on the CPU while the XGBoost runs on the GPU in FIL Backend we illustrate how each model in Triton's ensemble pipeline can run on different framework backend as well as different hardware configurations
- Thirdly, it highlights how the RAPIDS libraries (cuDF, cuML) are compatible with their CPU counterparts (Pandas, Scikit-learn). For example this way we get to show how LabelEncoders created in cuML can be used in Scikit-learn and vice-versa
We follow the instructions from the Triton documentation for packaging preprocessing dependencies (scikit-learn and pandas) to be used in the python backend as conda env tar file. The bash script create_prep_env.sh creates the conda environment tar file and then we move it into the preprocessing model directory.
After creating the tar file from the conda environment and placing it in model folder, you need to tell Python backend to use that environment for your model. We do this by including the lines below in the model config.pbtxt file:
parameters: {
key: "EXECUTION_ENV_PATH",
value: {string_value: "$$TRITON_MODEL_DIRECTORY/preprocessing_env.tar.gz"}
}
Here, $$TRITON_MODEL_DIRECTORY helps provide environment path relative to the model folder in model repository and is resolved to $pwd/model_repository/preprocessing. Finally preprocessing_env.tar.gz is the name we gave to our conda env file.
Set up Label Encoders
We also move the label encoders we had serialized earlier into preprocessing model folder so that we can use them to encode raw data categorical features at inference time.
Set up Tree-based ML Model for FIL Backend
Next, we set up the model directory for tree-based ML model like XGBoost which will be using FIL Backend.
The expected layout for model directory is similar to the one we showed above:

Here, fil is the name of the model. We can give it a different name like xgboost if we want to. 1 is the version sub-directory which contains the model artifact, in this case it's the xgboost.json model that we saved at the end of first notebook. Let's create this expected layout.
And then finally we need to have configuration file config.pbtxt describing the model configuration for tree-based ML model so that FIL Backend in Triton can understand how to serve it.
Create Config File for FIL Backend Model
You can read about all generic Triton configuration options here and about configuration options specific to the FIL backend here, but we will focus on just a few of the most common and relevant options in this example. Below are general descriptions of these options:
- max_batch_size: The maximum batch size that can be passed to this model. In general, the only limit on the size of batches passed to a FIL backend is the memory available with which to process them.
- input: Options in this section tell Triton the number of features to expect for each input sample.
- output: Options in this section tell Triton how many output values there will be for each sample. If the "predict_proba" option (described further on) is set to true, then a probability value will be returned for each class. Otherwise, a single value will be returned indicating the class predicted for the given sample.
- instance_group: This determines how many instances of this model will be created and whether they will use the GPU or CPU.
- model_type: A string indicating what format the model is in ("xgboost_json" in this example, but "xgboost", "lightgbm", and "tl_checkpoint" are valid formats as well).
- predict_proba: If set to true, probability values will be returned for each class rather than just a class prediction.
- output_class: True for classification models, false for regression models.
- threshold: A score threshold for determining classification. When output_class is set to true, this must be provided, although it will not be used if predict_proba is also set to true.
- storage_type: In general, using "AUTO" for this setting should meet most usecases. If "AUTO" storage is selected, FIL will load the model using either a sparse or dense representation based on the approximate size of the model. In some cases, you may want to explicitly set this to "SPARSE" in order to reduce the memory footprint of large models.
Here we have 15 input features and 2 classes (FRAUD, NOT FRAUD) that we are doing classification for in our XGBoost Model. Based on this information, let's set up FIL Backend configuration file for our tree-based model for serving on GPU.
Set up Inference Pipeline of Data Preprocessing Python Backend and FIL Backend using Ensemble
Now we are ready to set up the inference pipeline for data preprocessing and tree-based model inference using an ensemble model. An ensemble model represents a pipeline of one or more models and the connection of input and output tensors between those models. Here we use the ensemble model to build a pipeline of Data Preprocessing in Python backend followed by XGBoost in FIL Backend.
The expected layout for ensemble model directory is similar to the ones we showed above:

We created the ensemble model's config.pbtxt following the guidance on ensemble doc. Importantly, we need to set up the ensemble scheduler in config.pbtxt which specifies the dataflow between models within the ensemble. The ensemble scheduler collects the output tensors in each step, provides them as input tensors for other steps according to the specification.
Package model repository and upload to S3
Finally, we end up with the following model repository directory structure, containing a Python preprocessing model and its dependencies along with XGBoost FIL model, and the model ensemble.

We will package this up as model.tar.gz for uploading it to S3.
Create and Upload the model package for CPU-based instance (optimized for CPU)
If you do not have access to the default bucket. You can upload the model tar ball to the bucket and prefix of your choice using the following code:
model_uri="s3://<YourBucketName>/<YourPrefix>/model.tar.gz"
!aws s3 cp model.tar.gz "$model_uri"
Create and Upload the model package for GPU-based instance (optimized for GPU)

Create SageMaker Endpoint
We start off by creating a SageMaker model from the model repository we uploaded to S3 in the previous step.
In this step we also provide an additional Environment Variable SAGEMAKER_TRITON_DEFAULT_MODEL_NAME which specifies the name of the model to be loaded by Triton. The value of this key should match the folder name in the model package uploaded to S3. This variable is optional in case of a single model. In case of ensemble models, this key has to be specified for Triton to startup in SageMaker.
Additionally, customers can set SAGEMAKER_TRITON_BUFFER_MANAGER_THREAD_COUNT and SAGEMAKER_TRITON_THREAD_COUNT for optimizing the thread counts.
Using the model above, we create an endpoint configuration where we can specify the type and number of instances we want in the endpoint.
Using the above endpoint configuration we create a new SageMaker endpoint and wait for the deployment to finish. The status will change to InService once the deployment is successful.
Run Inference
Once we have the endpoint running we can use some sample raw data to do an inference using json as the payload format. For the inference request format, Triton uses the KFServing community standard inference protocols.
Call Model A (optimized for CPU)
Call Model B (optimized for GPU)
Binary + Json Payload
We can also use binary+json as the payload format to get better performance for the inference call. The specification of this format is provided here.
Note: With the binary+json format, we have to specify the length of the request metadata in the header to allow Triton to correctly parse the binary payload. This is done using a custom Content-Type header application/vnd.sagemaker-triton.binary+json;json-header-size={}.
Please note, this is different from using Inference-Header-Content-Length header on a stand-alone Triton server since custom headers are not allowed in SageMaker.
The tritonclient package provides utility methods to generate the payload without having to know the details of the specification. We'll use the following methods to convert our inference request into a binary format which provides lower latencies for inference.
Call Model A (optimized for CPU)
Call Model B (optimized for GPU)
Terminate endpoint and clean up artifacts
Conclusion
In this lab, we leveraged Triton Inference Server to create an ensemble to do Python preprocessing and used the XGBoost model to show how fraud can be detected using Triton and its corresponding Python and FIL backends. This example can further be used as a guide to create your own ensembles leveraging the other backends that Triton provides solving a wide variety of use cases that you may have that require scale and performance while using hardware for acceleration.
Notebook CI Test Results
This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.