Amazon Web Services Llama2 70b Lmi V7

Llama2 70b Lmi V7

data-scienceinferencedeploy-V7-lmiarchivedamazon-sagemaker-examplesreinforcement-learningmachine-learningWorkshopsawsexamplesdeep-learningsagemakerjupyter-notebooktrainingmlops

alph-notebooks/amazon-sagemaker-examples / llama2_70b_lmi_v7.ipynb

Export

Run Notebooks

Contents

No cells yet

Add cells to see them here

Host Llama2-70B on Amazon SageMaker using LMI V7 container

This notebooks CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook.

In this notebook, we deploy the Llama-2-70B model across GPUs on a ml.p4d.24xlarge instance.

Import the relevant libraries and configure several global variables using boto3

[ ]

Download the model artifacts and upload to S3

We recommend to first save the model in a S3 location and provide the S3 url in the serving.properties file. This allows faster downloads times

[ ]

Create serving.properties file, upload model to S3 and provide the inference container

SageMaker Large Model Inference containers can be used to host models without any additional inference code. You also have the option to provide your inference script if you need any custom pre-processing of the input data or postprocessing of the model's predictions.

SageMaker needs the model artifacts to be in a Tarball format. In this example, we provide the following files - serving.properties.

serving.properties is the configuration file that can be used to configure the model server.

[ ]

Create serving.properties

This is a configuration file to indicate to DJL Serving which model parallelization and inference optimization libraries you would like to use. Depending on your need, you can set the appropriate configuration.

Here is a list of settings that we use in this configuration file -

engine: The engine for DJL to use.
option.model_id: This can be the S3 uri of the pre-trained model or the model id of a pretrained model hosted inside a model repository on huggingface.co (https://huggingface.co/models).
option.tensor_parallel_degree: Set to the number of GPU devices over which Accelerate needs to partition the model. This parameter also controls the no of workers per model which will be started up when DJL serving runs. As an example if we have a 8 GPU machine and we are creating 8 partitions then we will have 1 worker per model to serve the requests.
option.rolling_batch : This parameter enables rolling batch of inputs
option.max_rolling_batch_size: Sets the max batch size
option.model_loading_timeout : Sets the timeout value for downloading and loading the model to serve inference

For more details on the configuration options and an exhaustive list, you can refer the documentation - https://docs.aws.amazon.com/sagemaker/latest/dg/realtime-endpoints-large-model-configuration.html.

In the below cell, we leverage Jinja to create a template for serving.properties. Specifically, we parameterize option.s3url so that it can be changed based on the pretrained model location.

[ ]

Image URI for the DJL container is being used here

[ ]

Create the Tarball and then upload to S3 location

[ ]

To create the end point the steps are:

Create the Model using the Image container and the Model Tarball uploaded earlier
Create the endpoint config using the following key parameters

a) Instance Type is ml.p4d.24xlarge

b) ContainerStartupHealthCheckTimeoutInSeconds is 3600 to ensure health check starts after the model is ready
Create the end point using the endpoint config created

Create the Model

Use the image URI for the DJL container and the s3 location to which the tarball was uploaded.

The container downloads the model into the /tmp space on the container because SageMaker maps the /tmp to the Amazon Elastic Block Store (Amazon EBS) volume that is mounted when we specify the endpoint creation parameter VolumeSizeInGB.

For instances like p4dn, which come pre-built with the volume instance, we can continue to leverage the /tmp on the container. The size of this mount is large enough to hold the model.

[ ]

This step can take ~ 7 min or longer so please be patient

[ ]

Leverage Boto3 to invoke the endpoint.

This is a generative model so we pass in a Text as a prompt and Model will complete the sentence and return the results.

You can pass a prompt as input to the model. This done by setting inputs to a prompt. The model then returns a result for each prompt. The text generation can be configured using appropriate parameters. These parameters need to be passed to the endpoint as a dictionary of kwargs. Refer this documentation - https://huggingface.co/docs/transformers/main/en/main_classes/text_generation#transformers.GenerationConfig for more details.

The below code sample illustrates the invocation of the endpoint using a text prompt and also sets some parameters.

[ ]

Conclusion

In this post, we demonstrated how to use SageMaker large model inference containers to host Llama2-70B. For more details about Amazon SageMaker and its large model inference capabilities, refer to the following:

Model parallelism and large model inference on Sagemaker (https://docs.aws.amazon.com/sagemaker/latest/dg/realtime-endpoints-large-model-inference.html)

Clean Up

[ ]

Notebook CI Test Results

This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.