Deploy Clip Model On Sagemaker
Deploy Clip model on SageMaker
This notebook's CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook.
In this notebook, we will deploy a CLIP model on the SageMaker endpoint with DJLServing container image.
Setup
Prepare inference script and container image
In this notebook, we will store the model artifacts on S3 and load the model directly from S3
The Large Model Inference (LMI) container uses s5cmd to download data from S3 which significantly reduces the speed when loading model during deployment. Therefore, we recommend to load the model from S3 by following the below section to download the model from Hugging Face and upload the model on S3. Note that, if you choose to load the model directly from Hugging Face during model deployment, you can prepare the model tarbal file and upload to S3 without downloading the model locally.
Download the model from Hugging Face and upload the model artifacts on Amazon S3
If you intend to download your copy of the model and upload it to a s3 location in your AWS account, please follow the below steps, else you can skip to the next step.
Please make sure the file is downloaded correctly by checking the files exist in the newly created folder blip2-model/models--Salesforce--<model-name>/snapshots/... before running the below cell.
SageMaker Large Model Inference containers can be used to host models without providing your own inference code. This is extremely useful when there is no custom pre-processing of the input data or post-processing of the model's predictions.
However, in this notebook, we demonstrate how to deploy a model with custom inference code.
SageMaker needs the model artifacts to be in a Tarball format. In this example, we provide the following files - serving.properties, model.py, and requirements.txt.
serving.propertiesis the configuration file that can be used to indicate to DJL Serving which model parallelization and inference optimization libraries you would like to use. Depending on your need, you can set the appropriate configuration. For more details on the configuration options and an exhaustive list, you can refer the documentation.model.pyis the script handles any requests for serving.requirements.txtis the text file containing any additional pip wheel need to install.
If you want to download the model from huggingface.co, you can set option.model_id. The model id of a pretrained model hosted inside a model repository on huggingface.co (https://huggingface.co/models). The container uses this model id to download the corresponding model repository on huggingface.co. If you set the model_id to a s3 url, the DJL will download the model artifacts from s3 and swap the model_id to the actual location of the model artifacts. In your script, you can point to this value to load the pre-trained model.
option.tensor_parallel_degree: Set to the number of GPU devices over which the model needs to be partitioned. This parameter also controls the number of workers per model which will be started up when DJL serving runs. As an example if we have a 8 GPU machine, and we are creating 8 partitions then we will have 1 worker per model to serve the requests.
Prepare the model tarball file and upload to S3
Deploy model
Test Inference Endpoint
Image captioning
Let's ask our model to classify the objects appearing in the above image
['cats']
Clean up
Uncomment the below cell to delete the endpoint and model when you finish the experiment
Notebook CI Test Results
This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.