Tgi Gpt Neox 20b

data-scienceinferencearchivedamazon-sagemaker-examplesreinforcement-learningmachine-learningawsexamplesdeep-learningsagemakerjupyter-notebooktrainingmlops

GPT-NeoX-20B on SageMaker using Hugging Face Text Generation Inference (TGI) DLC

This notebook demonstrates how to deploy GPT-NeoX-20B Language Model using Hugging Face Text Generation Inference (TGI) Deep Learning Container on Amazon SageMaker. GPT-NeoX-20B is a 20 billion parameter autoregressive language model trained on the Pile using the GPT-NeoX library

TGI is an open source, high performance inference library that can be used to deploy large language models from Hugging Face’s repository in minutes. The library includes advanced functionality like model parallelism and dynamic batching to simplify production inference with large language models like flan-t5-xxl, LLaMa, StableLM, and GPT-NeoX.

Setup

Install the SageMaker Python SDK

First, make sure that the latest version of SageMaker SDK is installed.

[ ]

Setup account and role

Then, we import the SageMaker python SDK and instantiate a sagemaker_session which we use to determine the current region and execution role.

[ ]

Retrieve the LLM Image URI

We use the helper function get_huggingface_llm_image_uri() to generate the appropriate image URI for the Hugging Face Large Language Model (LLM) inference.

The function takes a required parameter backend and several optional parameters. The backend specifies the type of backend to use for the model, the values can be "lmi" and "huggingface". The "lmi" stands for SageMaker LMI inference backend, and "huggingface" refers to using Hugging Face TGI inference backend.

[ ]

Create the Hugging Face Model

Next we configure the model object by specifying a unique name, the image_uri for the managed TGI container, and the execution role for the endpoint. Additionally, we specify a number of environment variables including the HF_MODEL_ID which corresponds to the model from the HuggingFace Hub that will be deployed, and the HF_TASK which configures the inference task to be performed by the model.

You should also define SM_NUM_GPUS, which specifies the tensor parallelism degree of the model. Tensor parallelism can be used to split the model across multiple GPUs, which is necessary when working with LLMs that are too big for a single GPU. Here, you should set SM_NUM_GPUS to the number of available GPUs on your selected instance type. For example, in this tutorial, we set SM_NUM_GPUS to 4 because our selected instance type ml.g5.12xlarge has 4 available GPUs.

Additionally, we could reduce the memory footprint of the model by setting the HF_MODEL_QUANTIZE environment variable to true.

[ ]

Creating a SageMaker Endpoint

Next we deploy the model by invoking the deploy() function. Here we use an ml.g5.12xlarge instance which come with 4 NVIDIA A10 GPUs. By setting the SM_NUM_GPUS environment variable to 4 in the last code block, we indicate that this model should be sharded across all 4 GPU devices.

[ ]

Running Inference

Once the endpoint is up and running, we can evaluate the model using the predict() function.

[ ]

Create sample chatbot application backed by SageMaker (Optional)

[ ]
[ ]

We can use this URL hosting the sample application to perform text generation with the model.

Cleaning Up

After you've finished using the endpoint, it's important to delete it to avoid incurring unnecessary costs.

[ ]

Conclusion

In this tutorial, we used a TGI container to deploy BLOOM-560m using 4 GPUs on a SageMaker ml.g5.12xlarge instance. With Hugging Face's Text Generation Inference and SageMaker Hosting, you can easily host large language models like GPT-NeoX, flan-t5-xxl, and LLaMa.