Amazon Web Services Falcon 40b Mpi

Falcon 40b Mpi

data-scienceinferencearchivedamazon-sagemaker-examplesreinforcement-learningmachine-learningdeploy-falcon-40b-and-7bWorkshopsawsexamplesdeep-learningsagemakerjupyter-notebooktrainingmlops

alph-notebooks/amazon-sagemaker-examples / falcon-40b-mpi.ipynb

Export

Run Notebooks

Contents

No cells yet

Add cells to see them here

Deploy Falcon 40B on Amazon SageMaker

This notebook's CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook.

In this notebook, we use the Large Model Inference (LMI) container from SageMaker Deep Learning Containers to host Falcon 40B on Amazon SageMaker.

We'll also see what configuration parameters can be used to optimize the endpoint for throughput and latency. We will deploy using a ml.g5.12xlarge instance for efficiency

Import the relevant libraries and configure several global variables using boto3

[ ]

Step 1: Prepare the model artifacts

The LMI container expects the following artifacts for hosting the model

serving.properties (required): Defines the model server settings and configurations.
model.py (optional): A python script that defines the inference logic.
requirements.txt (optional): Any additional pip wheels that need to be installed.

SageMaker expects the model artifacts in a tarball with the following structure -

code
├──── 
│   └── serving.properties
│   └── model.py
│   └── requirements.txt

In this notebook, we'll only provide a serving.properties. By default, the container runs the huggingface.py module from the djl python repository as the entry point code.

[ ]

Create the serving.properties

This is a configuration file to indicate to DJL Serving which model parallelization and inference optimization techniques you would like to use. Depending on your need, you can set the appropriate configuration.

Here is a list of settings that we use in this configuration file -

option.model_id: Used to download model from Hugging Face or S3 bucket.
option.tensor_parallel_degree: Set to the number of GPU devices over which to partition the model.
option.max_rolling_batch_size: Provide a size for maximum batch size for rolling/iteration level batching. Limits the number of concurrent requests.
option.rolling_batch: Select a rolling batch strategy. auto will make the handler choose the strategy based on the provided configuration. scheduler is a native rolling batch strategy supported for a single GPU. lmi-dist and vllm support multi-GPU rolling/iteration level batching.
option.paged_attention: Enabling this preallocates more GPU memory for caching. This is only supported when option.rolling_batch=lmi-dist or option.rolling_batch=auto.
option.max_rolling_batch_prefill_tokens: Only supported for option.rolling_batch=lmi-dist. Limits the number of tokens for caching. This needs to be tuned based on batch size and input sequence length to avoid GPU OOM. Use this to tune for your workload
engine: This is set to the runtime engine of the code. MPI below refers to the parallel processing framework. It is used by engines like DeepSpeed and FasterTransformer as well.

[ ]

Define a variable to store the s3 location that has the model weights

[ ]

Plug in the appropriate model location into the serving.properties file. For this publicly hosted model weights, the s3 URL depends on the region in which the notebook is executed.

[ ]

Create a model.tar.gz with the model artifacts

[ ]

Step 2: Create the SageMaker endpoint

Define the sagemaker inference URI to use for model inference.

[ ]

Upload artifact to S3 and create a SageMaker model

[ ]

This step can take ~ 10 min or longer so please be patient

[ ]

Step 3: Invoke the Endpoint

[ ]

Generation

[ ]

Translation

[ ]

Classification

[ ]

Question answering

[ ]

Summarization

[ ]

Clean up the environment

[ ]

Notebook CI Test Results

This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.