Amazon Web Services 2a Codegen25 FT 7b

2a Codegen25 FT 7b

data-scienceinferencearchivedamazon-sagemaker-examplesreinforcement-learningmachine-learningWorkshopslab-inference-components-with-scalingawsexamplesdeep-learningsagemakerjupyter-notebooktrainingmlops

alph-notebooks/amazon-sagemaker-examples / 2a_codegen25_FT_7b.ipynb

Export

Run Notebooks

Contents

No cells yet

Add cells to see them here

CodeGen 2.5 7b

In this notebook we will create and deploy a CodeGen2.5-7b model using inference components on the endpoint you created in the first notebook. For this model we will be using Faster Transformer using the SageMaker Large Model Inference (LMI) container. This is the 2nd notebook in a series of 5 notebooks used to deploy a model against the endpoint you created in the first notebook. The last notebook will show you other apis available and clean up the artifacts created.

This notebook's CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook.

Tested using the Python 3 (Data Science) kernel on SageMaker Studio and conda_python3 kernel on SageMaker Notebook Instance.

Install dependencies

Upgrade the SageMaker Python SDK.

[ ]

Set configuration

REPLACE the endpoint_name value with the created endpoint name stored in jupyter

[ ]

We first by creating the objects we will need for our notebook. In particular, the boto3 library to create the various clients we will need to interact with SageMaker and other variables that will be referenced later in our notebook.

[ ]

Preparing model artifacts and uploading them to S3

In LMI container, we expect some artifacts to help set up the model

serving.properties (required): Defines the model server settings
model.py (optional): A python file to define the core inference logic
requirements.txt (optional): Any additional pip wheel need to install

For CodeGen 2.5 which is a LLAMA architecture we will need to prepare the artifacts properlly to be used

The directory structure for the model structure in S3 MUST look like this to match the model.py code

Model uploaded to

fp-16:: s3://bucket/hf-large-model-djl-/codgen25-7b/fastertransformer/1/1-gpu/

Model Prefix must point to --- > s3://bucket/hf-large-model-djl-/codgen25-7b/fastertransformer/

model-triton-16B
    |
    |---fastertransformer
            |
            | --- config.pbtxt
            | --- 1
                    |
                    | -  1-gpu
                          |
                          | -- model weights

We start by making sure the directory we will work in locally is clean for our model artifacts

[ ]

Prepare LMI container serving.properties file

The LMI container give you the ability to deploy large models easily. By using the serving.properties file, you can easily set the options you want for deployment including what tensor parallel degree you want to use as well as other options like data type, quantization strategy and others. We start by creating this file in our working directory

[ ]

We will also need to specify the image of LMI that we would like to use

[ ]

Now that everything is prepared, we created our tarball and upload it to S3 so SageMaker can reference it at deploy time.

[ ]

Creating an inference component to your endpoint

Inference components can reuse a SageMaker model that you may have already created. You also have the option to specify your artifacts and container directly when creating an inference component which we will show below. In this example we will also create a SageMaker model if you want to reference it later.

[ ]

Create the Inference component

[ ]

We can now create our inference component. Note below that we specify an inference component name. You can use this name to update your inference compent or view metrics and logs on the inference component you create in CloudWatch. You will also want to set your "ComputeResourceRequirements". This will tell SageMaker how much of each resource you want to reserver for EACH COPY of your inference component. Finally we set the number of copies that we want to deploy. The number of copies can be managed through autoscaling policies.

[ ]

Leverage the Boto3 to invoke the endpoint.

This is a generative model so we pass in a Text as a prompt and Model will complete the sentence and return the results.

You can pass a batch of prompts as input to the model. This done by setting inputs to the list of prompts. The model then returns a result for each prompt. The text generation can be configured using appropriate parameters. These parameters need to be passed to the endpoint as a dictionary of kwargs. Refer this documentation - https://huggingface.co/docs/transformers/main/en/main_classes/text_generation#transformers.GenerationConfig for more details.

The below code sample illustrates the invocation of the endpoint using a prompts and also sets some parameters.

[ ]

Thats it! You have deployed the CodeGen2.5 model on SageMaker as an inference component. You can continue to the other notebooks to continue deploying other models or clean up the artifacts we have created through this example.

Notebook CI Test Results

This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.