2a Codegen25 FT 7b
CodeGen 2.5 7b
In this notebook we will create and deploy a CodeGen2.5-7b model using inference components on the endpoint you created in the first notebook. For this model we will be using Faster Transformer using the SageMaker Large Model Inference (LMI) container. This is the 2nd notebook in a series of 5 notebooks used to deploy a model against the endpoint you created in the first notebook. The last notebook will show you other apis available and clean up the artifacts created.
This notebook's CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook.
Tested using the Python 3 (Data Science) kernel on SageMaker Studio and conda_python3 kernel on SageMaker Notebook Instance.
Install dependencies
Upgrade the SageMaker Python SDK.
Set configuration
REPLACE the endpoint_name value with the created endpoint name stored in jupyter
We first by creating the objects we will need for our notebook. In particular, the boto3 library to create the various clients we will need to interact with SageMaker and other variables that will be referenced later in our notebook.
Preparing model artifacts and uploading them to S3
In LMI container, we expect some artifacts to help set up the model
- serving.properties (required): Defines the model server settings
- model.py (optional): A python file to define the core inference logic
- requirements.txt (optional): Any additional pip wheel need to install
For CodeGen 2.5 which is a LLAMA architecture we will need to prepare the artifacts properlly to be used
The directory structure for the model structure in S3 MUST look like this to match the model.py code
Model uploaded to
fp-16:: s3://bucket/hf-large-model-djl-/codgen25-7b/fastertransformer/1/1-gpu/
Model Prefix must point to --- > s3://bucket/hf-large-model-djl-/codgen25-7b/fastertransformer/
model-triton-16B
|
|---fastertransformer
|
| --- config.pbtxt
| --- 1
|
| - 1-gpu
|
| -- model weights
We start by making sure the directory we will work in locally is clean for our model artifacts
Prepare LMI container serving.properties file
The LMI container give you the ability to deploy large models easily. By using the serving.properties file, you can easily set the options you want for deployment including what tensor parallel degree you want to use as well as other options like data type, quantization strategy and others. We start by creating this file in our working directory
We will also need to specify the image of LMI that we would like to use
Now that everything is prepared, we created our tarball and upload it to S3 so SageMaker can reference it at deploy time.
Creating an inference component to your endpoint
Inference components can reuse a SageMaker model that you may have already created. You also have the option to specify your artifacts and container directly when creating an inference component which we will show below. In this example we will also create a SageMaker model if you want to reference it later.
Create the Inference component
We can now create our inference component. Note below that we specify an inference component name. You can use this name to update your inference compent or view metrics and logs on the inference component you create in CloudWatch. You will also want to set your "ComputeResourceRequirements". This will tell SageMaker how much of each resource you want to reserver for EACH COPY of your inference component. Finally we set the number of copies that we want to deploy. The number of copies can be managed through autoscaling policies.
Leverage the Boto3 to invoke the endpoint.
This is a generative model so we pass in a Text as a prompt and Model will complete the sentence and return the results.
You can pass a batch of prompts as input to the model. This done by setting inputs to the list of prompts. The model then returns a result for each prompt. The text generation can be configured using appropriate parameters. These parameters need to be passed to the endpoint as a dictionary of kwargs. Refer this documentation - https://huggingface.co/docs/transformers/main/en/main_classes/text_generation#transformers.GenerationConfig for more details.
The below code sample illustrates the invocation of the endpoint using a prompts and also sets some parameters.
Thats it! You have deployed the CodeGen2.5 model on SageMaker as an inference component. You can continue to the other notebooks to continue deploying other models or clean up the artifacts we have created through this example.
Notebook CI Test Results
This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.