Falcon 7b Deepspeed
Serve Falcon 7B model with Amazon SageMaker Hosting
This notebook's CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook.
In this example we walk through how to deploy and perform inference on the Falcon 7B model using the Large Model Inference(LMI) container provided by AWS using DJL Serving and DeepSpeed. The Falcon 7B model is a casual decoder model simlilar to the larger Falcon 40B model. We will deploy using a ml.g5.2xlarge instance for efficiency
Setup
Installs the dependencies required to package the model and run inferences using Amazon SageMaker. Update SageMaker, boto3 etc
Imports and variables
1. Create SageMaker compatible model artifacts
In order to prepare our model for deployment to a SageMaker Endpoint for hosting, we will need to prepare a few things for SageMaker and our container. We will use a local folder as the location of these files including serving.properties that defines parameters for the LMI container and requirements.txt to detail what dependies to install.
In the serving.properties files define the the engine to use and model to host. Note the tensor_parallel_degree parameter which is set to a value of 1 in this scenario. Since the entire model can fit on a sigle GPU we do not have to divide the model into multiple parts. In this case we will use a 'ml.g5.2xlarge' instance which provides 1 GPU. Be careful not to specify a value larger than the instance provides or your deployment will fail.
2. Create a model.py with custom inference code
SageMaker allows you to bring your own script for inference. Here we create our model.py file with the appropriate code for the Falcon 7B model.
3. Create the Tarball and then upload to S3 location
Next, we will package our artifacts as *.tar.gz files for uploading to S3 for SageMaker to use for deployment
4. Define a serving container, SageMaker Model and SageMaker endpoint
Now that we have uploaded the model artifacts to S3, we can create a SageMaker endpoint.
Define the serving container
Here we define the container to use for the model for inference. We will be using SageMaker's Large Model Inference(LMI) container using DeepSpeed.
Create SageMaker model, endpoint configuration and endpoint.
Run Inference
Clean Up
Notebook CI Test Results
This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.