Amazon Web Services Open Llama 7b

Open Llama 7b

data-scienceinferencearchivedamazon-sagemaker-examplesreinforcement-learningmachine-learningWorkshopsawsexamplesdeep-learninglab10-open-llamasagemakerjupyter-notebooktrainingmlops

alph-notebooks/amazon-sagemaker-examples / open_llama_7b.ipynb

Export

Run Notebooks

Contents

No cells yet

Add cells to see them here

Open-LLAMA 7B implementation using LMI container on SageMaker

Model source: https://github.com/openlm-research/open_llama ;

Model download hub: https://huggingface.co/openlm-research/open_llama_7b;

License: Apache-2.0

In this tutorial, you will bring your own container from docker hub to SageMaker and run inference with it. Please make sure the following permission granted before running the notebook:

ECR Push/Pull access
S3 bucket push access
SageMaker access

Attribution: this notebook is based on the content of https://github.com/deepjavalibrary/djl-demo/tree/master and was debugged with the help of lanking520.

Step 1: Let's bump up SageMaker and import stuff

[1]

Note: you may need to restart the kernel to use updated packages.

[ ]

[3]

[4]

arn:aws:iam::328296961357:role/service-role/AmazonSageMaker-ExecutionRole-20191125T182032 us-west-2 328296961357

[5]

'2.161.0'

Step 2 pull and push the docker from Docker hub to ECR repository (optional)

*Note: you can either use a prebuilt container or use the cell below (change cell type to 'code' from 'raw")

Note: Please make sure you have the permission in AWS credential to push to ECR repository

This process may take a while, depends on the container size and your network bandwidth

Note: you only need to build this container once. Once you pushed it in ECR, you can pull the image via

image_uri = f"{account_id}.dkr.ecr.{region}.amazonaws.com/{repo_name}:latest"

Step 3: Start preparing model artifacts

In LMI container, we expect some artifacts to help set up the model

serving.properties (required): Defines the model server settings
model.py (optional): A python file to define the core inference logic
requirements.txt (optional): Any additional pip wheel need to install

[7]

Writing serving.properties

[8]

Writing model.py

[9]

Writing requirements.txt

[10]

mymodel/
mymodel/requirements.txt
mymodel/model.py
mymodel/serving.properties

Step 4: Start building SageMaker endpoint

In this step, we will build SageMaker endpoint from scratch

4.1 Upload artifact on S3 and create SageMaker model

[12]

S3 Code or Model tar ball uploaded to --- > s3://sagemaker-us-west-2-328296961357/large-model-lmi/code/mymodel.tar.gz
763104351884.dkr.ecr.us-west-2.amazonaws.com/djl-inference:0.23.0-deepspeed0.9.5-cu118

4.2 Create SageMaker endpoint

You need to specify the instance to use and endpoint names

[13]

--------------!

Step 5a: Test and benchmark inference latency

The latency is heavily dependent on 'max_new_tokens' parameter

[14]

2.2340340614318848

Let us define a helper function to get a histogram of invocation latency distribution

[15]

Matplotlib is building the font cache; this may take a moment.

[16]

100%|██████████| 10/10 [01:53<00:00, 11.35s/it]

114.2704861164093
CPU times: user 258 ms, sys: 39.5 ms, total: 298 ms
Wall time: 1min 54s

[17]

open-llama-lmi-model-2023-06-02-00-16-24-723
us-west-2

Step 5b: Analyze Inference Latency via CloudWatch

[18]

[19]

[20]

2023-06-02 00:26:07.841647
2023-06-02 00:23:13.571161

[21]

[22]

[23]

Clean up the environment

[ ]