Open Llama 7b
data-scienceinferencearchivedamazon-sagemaker-examplesreinforcement-learningmachine-learningWorkshopsawsexamplesdeep-learninglab10-open-llamasagemakerjupyter-notebooktrainingmlops
Export
Open-LLAMA 7B implementation using LMI container on SageMaker
Model source: https://github.com/openlm-research/open_llama ;
Model download hub: https://huggingface.co/openlm-research/open_llama_7b;
License: Apache-2.0
In this tutorial, you will bring your own container from docker hub to SageMaker and run inference with it. Please make sure the following permission granted before running the notebook:
- ECR Push/Pull access
- S3 bucket push access
- SageMaker access
Attribution: this notebook is based on the content of https://github.com/deepjavalibrary/djl-demo/tree/master and was debugged with the help of lanking520.
Step 1: Let's bump up SageMaker and import stuff
[1]
Note: you may need to restart the kernel to use updated packages.
[ ]
[3]
[4]
arn:aws:iam::328296961357:role/service-role/AmazonSageMaker-ExecutionRole-20191125T182032 us-west-2 328296961357
[5]
'2.161.0'
Step 2 pull and push the docker from Docker hub to ECR repository (optional)
*Note: you can either use a prebuilt container or use the cell below (change cell type to 'code' from 'raw")
Note: Please make sure you have the permission in AWS credential to push to ECR repository
This process may take a while, depends on the container size and your network bandwidth
Note: you only need to build this container once. Once you pushed it in ECR, you can pull the image via
image_uri = f"{account_id}.dkr.ecr.{region}.amazonaws.com/{repo_name}:latest"
Step 3: Start preparing model artifacts
In LMI container, we expect some artifacts to help set up the model
- serving.properties (required): Defines the model server settings
- model.py (optional): A python file to define the core inference logic
- requirements.txt (optional): Any additional pip wheel need to install
[7]
Writing serving.properties
[8]
Writing model.py
[9]
Writing requirements.txt
[10]
mymodel/ mymodel/requirements.txt mymodel/model.py mymodel/serving.properties
Step 4: Start building SageMaker endpoint
In this step, we will build SageMaker endpoint from scratch
4.1 Upload artifact on S3 and create SageMaker model
[12]
S3 Code or Model tar ball uploaded to --- > s3://sagemaker-us-west-2-328296961357/large-model-lmi/code/mymodel.tar.gz 763104351884.dkr.ecr.us-west-2.amazonaws.com/djl-inference:0.23.0-deepspeed0.9.5-cu118
4.2 Create SageMaker endpoint
You need to specify the instance to use and endpoint names
[13]
--------------!
Step 5a: Test and benchmark inference latency
The latency is heavily dependent on 'max_new_tokens' parameter
[14]
2.2340340614318848
Let us define a helper function to get a histogram of invocation latency distribution
[15]
Matplotlib is building the font cache; this may take a moment.
[16]
100%|██████████| 10/10 [01:53<00:00, 11.35s/it]
114.2704861164093 CPU times: user 258 ms, sys: 39.5 ms, total: 298 ms Wall time: 1min 54s
[17]
open-llama-lmi-model-2023-06-02-00-16-24-723 us-west-2
Step 5b: Analyze Inference Latency via CloudWatch
[18]
[19]
[20]
2023-06-02 00:26:07.841647 2023-06-02 00:23:13.571161
[21]
[22]
[23]
Clean up the environment
[ ]