T5 Pytorch Python Backend
Serve Multiple DL models on GPU with Amazon SageMaker Multi-model endpoints (MME)
This notebook's CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook.
Amazon SageMaker multi-model endpoints(MME) provide a scalable and cost-effective way to deploy large number of deep learning models. Previously, customers had limited options to deploy 100s of deep learning models that need accelerated compute with GPUs. Now customers can deploy 1000s of deep learning models behind one SageMaker endpoint. Now, MME will run multiple models on a GPU, share GPU instances behind an endpoint across multiple models and dynamically load/unload models based on the incoming traffic. With this, customers can significantly save cost and achieve best price performance.
In this notebook, we will walk you through how to use NVIDIA Triton Inference Server on Amazon SageMaker MME with GPU feature to deploy a T5 NLP model for Translation.
Installs
Installs the dependencies required to package the model and run inferences using Triton server. Update SageMaker, boto3, awscli etc
Imports and variables
Workflow Overview
This section presents overview of main steps for preparing a T5 Pytorch model (served using Python backend) using Triton Inference Server.
1. Generate Model Artifacts
T5 PyTorch Model
In case of T5-small HuggingFace PyTorch Model, since we are serving it using Triton's python backend we have python script model.py which implements all the logic to initialize the T5 model and execute inference for the translation task.
2. Build Model Respository
Using Triton on SageMaker requires us to first set up a model repository folder containing the models we want to serve. For each model we need to create a model directory consisting of the model artifact and define config.pbtxt file to specify model configuration which Triton uses to load and serve the model.
T5 Python Backend Model
Model repository structure for T5 Model.
t5_pytorch
├── 1
│ └── model.py
└── config.pbtxt
Next we set up the T5 PyTorch Python Backend Model in the model repository:
Create Conda Environment for Dependencies
For serving the HuggingFace T5 PyTorch Model using Triton's Python backend we have PyTorch and HuggingFace transformers as dependencies.
We follow the instructions from the Triton documentation for packaging dependencies to be used in the python backend as conda env tar file. Running the bash script create_hf_env.sh creates the conda environment containing PyTorch and HuggingFace transformers, packages it as tar file and then we move it into the t5-pytorch model directory. This can take a few minutes.
After creating the tar file from the conda environment and placing it in model folder, you need to tell Python backend to use that environment for your model. We do this by including the lines below in the model config.pbtxt file:
parameters: {
key: "EXECUTION_ENV_PATH",
value: {string_value: "$$TRITON_MODEL_DIRECTORY/hf_env.tar.gz"}
}
Here, $$TRITON_MODEL_DIRECTORY helps provide environment path relative to the model folder in model repository and is resolved to $pwd/model_repository/t5_pytorch. Finally hf_env.tar.gz is the name we gave to our conda env file.
Now we are ready to define the config file for t5 pytorch model being served through Triton's Python Backend:
3. Package models and upload to S3
Next, we will package our model as *.tar.gz files for uploading to S3.
4. Create SageMaker Endpoint
Now that we have uploaded the model artifacts to S3, we can create a SageMaker multi-model endpoint.
Define the serving container
In the container definition, define the ModelDataUrl to specify the S3 directory that contains all the models that SageMaker multi-model endpoint will use to load and serve predictions. Set Mode to MultiModel to indicate SageMaker would create the endpoint with MME container specifications. We set the container with an image that supports deploying multi-model endpoints with GPU
Create a multi-model object
Once the image, data location are set we create the model using create_model by specifying the ModelName and the Container definition
Define configuration for the multi-model endpoint
Using the model above, we create an endpoint configuration where we can specify the type and number of instances we want in the endpoint. Here we are deploying to g5.2xlarge NVIDIA GPU instance.
Create Multi-Model Endpoint
Using the above endpoint configuration we create a new sagemaker endpoint and wait for the deployment to finish. The status will change to InService once the deployment is successful.
5. Run Inference
Once we have the endpoint running we can use some sample raw data to do an inference using JSON as the payload format. For the inference request format, Triton uses the KFServing community standard inference protocols.
Add utility methods for preparing JSON request payload
We'll use the following utility methods to convert our inference request for DistilBERT and T5 models into a json payload.
Invoke target model on Multi Model Endpoint
We can send inference request to multi-model endpoint using invoke_enpoint API. We specify the TargetModel in the invocation call and pass in the payload for each model type.
T5 PyTorch Model
Next, we show some sample inference for translation on the T5 PyTorch Model deployed on Triton's Python Backend behind SageMaker MME GPU endpoint
Sample T5 Inference using Json Payload
Sample T5 Inference using Binary + Json Payload
Terminate endpoint and clean up artifacts
Notebook CI Test Results
This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.