Falcon 40b Mpi
Deploy Falcon 40B on Amazon SageMaker
This notebook's CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook.
In this notebook, we use the Large Model Inference (LMI) container from SageMaker Deep Learning Containers to host Falcon 40B on Amazon SageMaker.
We'll also see what configuration parameters can be used to optimize the endpoint for throughput and latency. We will deploy using a ml.g5.12xlarge instance for efficiency
Import the relevant libraries and configure several global variables using boto3
Step 1: Prepare the model artifacts
The LMI container expects the following artifacts for hosting the model
serving.properties(required): Defines the model server settings and configurations.model.py(optional): A python script that defines the inference logic.requirements.txt(optional): Any additional pip wheels that need to be installed.
SageMaker expects the model artifacts in a tarball with the following structure -
code
├────
│ └── serving.properties
│ └── model.py
│ └── requirements.txt
In this notebook, we'll only provide a serving.properties. By default, the container runs the huggingface.py module from the djl python repository as the entry point code.
Create the serving.properties
This is a configuration file to indicate to DJL Serving which model parallelization and inference optimization techniques you would like to use. Depending on your need, you can set the appropriate configuration.
Here is a list of settings that we use in this configuration file -
option.model_id: Used to download model from Hugging Face or S3 bucket.option.tensor_parallel_degree: Set to the number of GPU devices over which to partition the model.option.max_rolling_batch_size: Provide a size for maximum batch size for rolling/iteration level batching. Limits the number of concurrent requests.option.rolling_batch: Select a rolling batch strategy.autowill make the handler choose the strategy based on the provided configuration.scheduleris a native rolling batch strategy supported for a single GPU.lmi-distandvllmsupport multi-GPU rolling/iteration level batching.option.paged_attention: Enabling this preallocates more GPU memory for caching. This is only supported whenoption.rolling_batch=lmi-distoroption.rolling_batch=auto.option.max_rolling_batch_prefill_tokens: Only supported foroption.rolling_batch=lmi-dist. Limits the number of tokens for caching. This needs to be tuned based on batch size and input sequence length to avoid GPU OOM. Use this to tune for your workloadengine: This is set to the runtime engine of the code.MPIbelow refers to the parallel processing framework. It is used by engines likeDeepSpeedandFasterTransformeras well.
Define a variable to store the s3 location that has the model weights
Plug in the appropriate model location into the serving.properties file. For this publicly hosted model weights, the s3 URL depends on the region in which the notebook is executed.
Create a model.tar.gz with the model artifacts
Step 2: Create the SageMaker endpoint
Define the sagemaker inference URI to use for model inference.
Upload artifact to S3 and create a SageMaker model
This step can take ~ 10 min or longer so please be patient
Step 3: Invoke the Endpoint
Generation
Translation
Classification
Question answering
Summarization
Clean up the environment
Notebook CI Test Results
This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.