Amazon Web Services T5 Pytorch Python Backend

T5 Pytorch Python Backend

data-scienceinferencearchivedamazon-sagemaker-examplesreinforcement-learningmachine-learningt5_pytorch_python-backendawsexamplesdeep-learningsagemakerjupyter-notebooktrainingmlops

alph-notebooks/amazon-sagemaker-examples / t5_pytorch_python-backend.ipynb

Export

Run Notebooks

Contents

No cells yet

Add cells to see them here

Serve Multiple DL models on GPU with Amazon SageMaker Multi-model endpoints (MME)

This notebook's CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook.

Amazon SageMaker multi-model endpoints(MME) provide a scalable and cost-effective way to deploy large number of deep learning models. Previously, customers had limited options to deploy 100s of deep learning models that need accelerated compute with GPUs. Now customers can deploy 1000s of deep learning models behind one SageMaker endpoint. Now, MME will run multiple models on a GPU, share GPU instances behind an endpoint across multiple models and dynamically load/unload models based on the incoming traffic. With this, customers can significantly save cost and achieve best price performance.

💡 Note This notebook was tested with the `conda_python3` kernel on an Amazon SageMaker notebook instance of type `g5.xlarge`.

In this notebook, we will walk you through how to use NVIDIA Triton Inference Server on Amazon SageMaker MME with GPU feature to deploy a T5 NLP model for Translation.

Installs

Installs the dependencies required to package the model and run inferences using Triton server. Update SageMaker, boto3, awscli etc

[ ]

Imports and variables

[ ]

Workflow Overview

This section presents overview of main steps for preparing a T5 Pytorch model (served using Python backend) using Triton Inference Server.

1. Generate Model Artifacts

T5 PyTorch Model

In case of T5-small HuggingFace PyTorch Model, since we are serving it using Triton's python backend we have python script model.py which implements all the logic to initialize the T5 model and execute inference for the translation task.

2. Build Model Respository

Using Triton on SageMaker requires us to first set up a model repository folder containing the models we want to serve. For each model we need to create a model directory consisting of the model artifact and define config.pbtxt file to specify model configuration which Triton uses to load and serve the model.

T5 Python Backend Model

Model repository structure for T5 Model.

t5_pytorch
├── 1
│   └── model.py
└── config.pbtxt

Next we set up the T5 PyTorch Python Backend Model in the model repository:

[ ]

Create Conda Environment for Dependencies

For serving the HuggingFace T5 PyTorch Model using Triton's Python backend we have PyTorch and HuggingFace transformers as dependencies.

We follow the instructions from the Triton documentation for packaging dependencies to be used in the python backend as conda env tar file. Running the bash script create_hf_env.sh creates the conda environment containing PyTorch and HuggingFace transformers, packages it as tar file and then we move it into the t5-pytorch model directory. This can take a few minutes.

[ ]

After creating the tar file from the conda environment and placing it in model folder, you need to tell Python backend to use that environment for your model. We do this by including the lines below in the model config.pbtxt file:

parameters: {
  key: "EXECUTION_ENV_PATH",
  value: {string_value: "$$TRITON_MODEL_DIRECTORY/hf_env.tar.gz"}
}

Here, $$TRITON_MODEL_DIRECTORY helps provide environment path relative to the model folder in model repository and is resolved to $pwd/model_repository/t5_pytorch. Finally hf_env.tar.gz is the name we gave to our conda env file.

Now we are ready to define the config file for t5 pytorch model being served through Triton's Python Backend:

[ ]

3. Package models and upload to S3

Next, we will package our model as *.tar.gz files for uploading to S3.

[ ]

4. Create SageMaker Endpoint

Now that we have uploaded the model artifacts to S3, we can create a SageMaker multi-model endpoint.

Define the serving container

In the container definition, define the ModelDataUrl to specify the S3 directory that contains all the models that SageMaker multi-model endpoint will use to load and serve predictions. Set Mode to MultiModel to indicate SageMaker would create the endpoint with MME container specifications. We set the container with an image that supports deploying multi-model endpoints with GPU

[ ]

Create a multi-model object

Once the image, data location are set we create the model using create_model by specifying the ModelName and the Container definition

[ ]

Define configuration for the multi-model endpoint

Using the model above, we create an endpoint configuration where we can specify the type and number of instances we want in the endpoint. Here we are deploying to g5.2xlarge NVIDIA GPU instance.

[ ]

Create Multi-Model Endpoint

Using the above endpoint configuration we create a new sagemaker endpoint and wait for the deployment to finish. The status will change to InService once the deployment is successful.

[ ]

5. Run Inference

Once we have the endpoint running we can use some sample raw data to do an inference using JSON as the payload format. For the inference request format, Triton uses the KFServing community standard inference protocols.

Add utility methods for preparing JSON request payload

We'll use the following utility methods to convert our inference request for DistilBERT and T5 models into a json payload.

[ ]

Invoke target model on Multi Model Endpoint

We can send inference request to multi-model endpoint using invoke_enpoint API. We specify the TargetModel in the invocation call and pass in the payload for each model type.

T5 PyTorch Model

Next, we show some sample inference for translation on the T5 PyTorch Model deployed on Triton's Python Backend behind SageMaker MME GPU endpoint

[ ]

Sample T5 Inference using Json Payload

[ ]

Sample T5 Inference using Binary + Json Payload

[ ]

Terminate endpoint and clean up artifacts

[ ]

Notebook CI Test Results

This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.