Amazon Web Services Llama2 7b Batching Throughput

Llama2 7b Batching Throughput

data-scienceinferencearchivedamazon-sagemaker-examplesreinforcement-learningmachine-learningWorkshopsawsexamplesdeep-learningsagemakerjupyter-notebooktrainingmlops

alph-notebooks/amazon-sagemaker-examples / llama2-7b-batching-throughput.ipynb

Export

Run Notebooks

Contents

No cells yet

Add cells to see them here

Increase throughput for Llama2-7b Model using Batching techniques on SageMaker LMI v5

This notebook's CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook.

In this notebook, we explore how to use different batching techniques to increase throughput for LLama2-7b large language model on SageMaker using LMI v5 container. We use DJLServing as the model serving solution in this example that is bundled in the LMI container. DJLServing is a high-performance universal model serving solution powered by the Deep Java Library (DJL) that is programming language agnostic. To learn more about DJL and DJLServing, you can refer to this link (https://docs.djl.ai/docs/serving/index.html).

Batching helps to increase throughput for Generative AI inferencing by combining requests and sending them together to the LLM as a batch. We explore three batching techniques i.e. Dynamic Batching, Continuous Batching and Paged Attention Batching in this notebook and demonstrate the achieved throughput gains.

We utilize SageMaker LMI v5 container which provides rolling batch capability for Continuous Batching along with Paged Attention. In this notebook, we deploy https://huggingface.co/TheBloke/Llama-2-7B-fp16 model across GPUs on a ml.g5.12xlarge instance.

Import required libraries and establish session using SageMaker SDK

[ ]

[OPTIONAL] Download the model from Hugging Face and upload the model artifacts on Amazon S3

If you intend to download your copy of the model and upload it to a s3 location in your AWS account, please follow the below steps, else you can skip to the next step.

[ ]

Define a variable to contain the s3 url of the location that has the model

[ ]

Deploy 3 endpoints for benchmarking with settings to enable different Batching techniques

We will deploy 3 different endpoints for benchmarking with different Batching techniques as below:

Dynamic Batching
Continuous Batching
Paged Attention Batching

1. Dynamic Batching

1.1 Create serving.properties for Dynamic Batching

This is a configuration file to indicate to DJL Serving which model parallelization and inference optimization libraries you would like to use. Depending on your need, you can set the appropriate configuration.

Here is a list of settings that we use in this configuration file -

engine: The engine for DJL to use. In this case, we have set it to Python (Dynamic Batching) or MPI (Continuous and Paged Attention Batching).
option.model_id: The model ID of a pretrained model hosted inside a model repository on huggingface.co (https://huggingface.co/models) or S3 path to the model artifacts. 
option.tensor_parallel_degree: Set to the number of GPU devices over which Accelerate needs to partition the model. This parameter also controls the no of workers per model which will be started up when DJL serving runs. As an example if we have a 8 GPU machine and we are creating 8 partitions then we will have 1 worker per model to serve the requests.

For more details on the configuration options and an exhaustive list, you can refer the documentation - https://docs.aws.amazon.com/sagemaker/latest/dg/realtime-endpoints-large-model-configuration.html.

[ ]

Image URI for the DJL container is being used here

[ ]

Create the Tarball and then upload to S3 location

[ ]

1.2 Deploy endpoint for Dynamic Batching

[ ]

Wait for endpoint to be In-service. This can take a while, so please be patient

[ ]

2. Continuous Batching

2.1 Create serving.properties for Continuous Batching

[ ]

2.2 Deploy endpoint for Continuous Batching

[ ]

This can take a while, so please be patient

[ ]

3. Paged Attention Batching

3.1 Create serving.properties for Paged Attention Batching

[ ]

3.2 Deploy endpoint for Paged Attention Batching

[ ]

This can take a while, so please be patient

[ ]

Benchmark your model

This is a generative model, so we pass in a Text as a prompt and the Model will complete the sentence and return the results. We will use awscurl command line tool to try it (the awscutl command line tool requires java)

We pass a multiple prompts as input to the model. This is done by downloading below benchmarking tool and setting up the desired performance test env for running tests.

Following steps need to be run in a studio terminal or EC2 or Notebook instance. In below example we use a concurrency of 50 clients sending 100 requests from a g5.12xlarge studio notebook terminal, however if you are looking to run lower/higher concurrency benchmarking then feel free to use a lower/higher compute instance to run the concurrent benchmark tests.

Install Java using the shell

[ ]

Download the benchmarking tool

[ ]

Create prompts to be used with Dynamic Batching

[ ]

Set up credentials using env vars or use .aws/credentials file

[ ]

Dynamic Batching Benchmarking

We run the benchmarking tool to get the results for Dynamic Batching. You can change the concurrency through -c and number of requests through -N, however please ensure that you are using an instance with enough compute to run the test and endpoint is deployed on a capable instance to handle the concurrency. Run awscurl -h to get more help on the benchmark tool.

[ ]

After the benchmark tests are completed, we will see the results in below format. Below values are sample and actual results will vary based on env setup.

Total time: 25073.52 ms.
Non 200 responses: 0, error rate: 0.00
Concurrent clients: 2
Total requests: 4
TPS: 0.16/s
Total token: 512
token/req: 128
token/s: 20.42/s
Average Latency: 12359.70 ms.
P50: 14058.52 ms.
P90: 14059.23 ms.
P99: 14059.23 ms.

Please make a note of the actual values for later comparison

Continuous Batching Benchmarking

We run the benchmarking tool to get the results for Continuous Batching.

[ ]

Please make a note of actual values from the benchmark tests for later comparison

Benchmarking throughput for Paged Attention Batching

We run the benchmarking tool to get the results for Paged Attention Batching.

[ ]

Please make a note of actual values from the benchmark tests for later comparison

We can now compare the throughput results e.g. Token/s, TPS for different batching techniques to review the throughout gains achieved by using Continuous and Paged Attention Batching over Dynamic Batching.

Model	Batching strategy	TPS	Token/s
llama2-7b	Dynamic Batching	2.62	336.28
llama2-7b	Continuous Batching	6.14	849.48
llama2-7b	PagedAttention Batching	6.29	889.68

Clean Up

Delete the resources (Endpoint, Endpoint config, Model) deployed for the 3 endpoints used in above tests.

Notebook CI Test Results

This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.