Deploy Openchatkit On Sagemaker
This notebook's CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook.
Deploy OpenChatKit Model with high performance on SageMaker
In this notebook, we explore how to host a large language model on SageMaker using the latest container that packages some of the most popular open source libraries for model parallel inference like DeepSpeed and Hugging Face Accelerate. We use DJLServing as the model serving solution in this example. DJLServing is a high-performance universal model serving solution powered by the Deep Java Library (DJL) that is programming language agnostic. To learn more about DJL and DJLServing, you can refer to our recent blog post (https://aws.amazon.com/blogs/machine-learning/deploy-bloom-176b-and-opt-30b-on-amazon-sagemaker-with-large-model-inference-deep-learning-containers-and-deepspeed/).
Language models have recently exploded in both size and popularity. In 2018, BERT-large entered the scene and, with its 340M parameters and novel transformer architecture, set the standard on NLP task accuracy. Within just a few years, state-of-the-art NLP model size has grown by more than 500x with models such as OpenAI’s 175 billion parameter GPT-3 and similarly sized open source Bloom 176B raising the bar on NLP accuracy. This increase in the number of parameters is driven by the simple and empirically-demonstrated positive relationship between model size and accuracy: more is better. With easy access from models zoos such as Hugging Face and improved accuracy in NLP tasks such as classification and text generation, practitioners are increasingly reaching for these large models. However, deploying them can be a challenge because of their size.
Model parallelism can help deploy large models that would normally be too large for a single GPU. With model parallelism, we partition and distribute a model across multiple GPUs. Each GPU holds a different part of the model, resolving the memory capacity issue for the largest deep learning models with billions of parameters. This notebook uses tensor parallelism techniques which allow GPUs to work simultaneously on the same layer of a model and achieve low latency inference relative to a pipeline parallel solution.
SageMaker has rolled out DeepSpeed and Accelerate container which now provides users with the ability to leverage the managed serving capabilities and help to provide the un-differentiated heavy lifting.
In this notebook, we deploy the open source GPT-NeoXT-Chat-Base-20B (OpenChatKit) model across GPUs on a ml.g5.12xlarge instance. The open source GPT-JT-Moderation-6b model is deployed across GPUs in the same instance
OpenChatKit provides a powerful, open-source base to create both specialized and general purpose chatbots for various applications. The kit includes an instruction-tuned 20 billion parameter language model, a 6 billion parameter moderation model, and an extensible retrieval system for including up-to-date responses from custom repositories. It was trained on the OIG-43M training dataset, which was a collaboration between Together, LAION, and Ontocord.ai. Much more than a model release, this is the beginning of an open source project. We are releasing a set of tools and processes for ongoing improvement with community contributions. You can read more information on OpenChatKit here
In this example, we demonstrate how to use SageMaker large model inference container to host OpenChatKit. We used HuggingFace Accelerate's model parallel techniques with multiple GPUs on a single SageMaker machine learning instance. OpenChatKit also includes an extensible retrieval system. With the retrieval system the chatbot is able to incorporate regularly updated or custom content, such as knowledge from Wikipedia, news feeds, or sports scores in response. The additional component of OpenChatKit is a 6 billion parameter moderation model fine-tuned from GPT-JT. In chat applications, the moderation model runs in tandem with the main chat model, checking the user utterance for any inappropriate content. Based on the moderation model’s assessment, the chatbot can limit the input to moderated subjects. For more narrow tasks the moderation model can be used to detect out-of-domain questions and override when the question is not on topic Please refer to this blog post to extend this model with retrieval system.
Invocations to SageMaker endpoints are stateless, so a model cannot automatically refer to past messages in computing new outputs. As a result, a DynamoDB table is created to store conversations based on a unique identifier generated by the endpoint. When this identifier is passed in with the invocation request, the model concatenates the new prompt with the previous conversation before performing inference.
As a result, the IAM role used for the endpoint needs permissions for the following actions:
dynamodb:CreateTabledynamodb:DescribeTabledynamodb:PutItemdynamodb:GetItem
HuggingFace Accelerate is used for tensor parallelism inference while DJLServing handles inference requests and the distributed workers. For further reading on HuggingFace you can refer to https://huggingface.co/docs
Licence agreement
- View license information https://github.com/togethercomputer/OpenChatKit/blob/main/LICENSE before using the model.
- This notebook is a sample notebook and not intended for production use. Please refer to the licence at https://github.com/aws/mit-0.
- Faiss is available from https://github.com/facebookresearch/faiss. View license information at https://github.com/facebookresearch/faiss/blob/main/LICENSE
Download the models from Hugging Face and upload the model artifacts on Amazon S3
Create SageMaker compatible Model artifact, upload Model to S3 and bring your own inference script.
SageMaker Large Model Inference containers can be used to host models without providing your own inference code. This is extremely useful when there is no custom pre-processing of the input data or post-processing of the model's predictions.
However, in this notebook, we demonstrate how to deploy a model with custom inference code.
SageMaker needs the model artifacts to be in a Tarball format. In this example, we provide the following files - serving.properties and model.py.
The tarball is in the following format
code
├────
│ └── serving.properties
│ └── model.py
serving.propertiesis the configuration file that can be used to configure the model server.model.pyis the script handles any requests for serving.
Create serving.properties
This is a configuration file to indicate to DJL Serving which model parallelization and inference optimization libraries you would like to use. Depending on your need, you can set the appropriate configuration.
Here is a list of settings that we use in this configuration file -
engine: The engine for DJL to use. In this case, it is Python.option.entryPoint: The entry point python file or module. This should align with the engine that is being used.option.s3url: Set this to the URI of the Amazon S3 bucket that contains the model.
If you want to download the model from huggingface.co, you can set option.modelid. The model ID of a pretrained model hosted inside a model repository on huggingface.co (https://huggingface.co/models). The container uses this model ID to download the corresponding model repository on huggingface.co.
option.tensor_parallel_degree: Set to the number of GPU devices over which HuggingFace Accelerate needs to partition the model. This parameter also controls the number of workers per model which will be started up when DJL serving runs. As an example if we have an 8 GPU machine, and we are creating 8 partitions then we will have 1 worker per model to serve the requests.
For more details on the configuration options and an exhaustive list, you can refer the documentation - https://docs.aws.amazon.com/sagemaker/latest/dg/realtime-endpoints-large-model-configuration.html.
HuggingFace Accelerate can automatically handle the device map computation by setting the device_map option to a supported option, or a device map can be provided. By using the auto device map, HuggingFace evenly splits the model across all available GPUs by maximising the available GPU RAM
The below code implements the handling logic for the main OpenChatKit GPT-NeoX model. The overall solution is implemented over 4 files to handle:
- Receiving inference request and handling it (
model.py) - Downloading and preparing the Wikipedia index (
wikipedia_prepare.py) - Searching the Wikipedia Index for relevant documents (
wikipedia.py) - Storing and retrieving the conversation thread in DynamoDB for passing to the model and user (
conversation.py)
model.py implements a class OpenChatKitService which handles passing the data between the GPT-JT Moderation mode, GPT NeoX model, Faiss search, and the conversation object. This is called on when inference is performed. This will also generate a unique ID for each invocation if one is not supplied for the purpose of storing the prompts in DynamoDB.
The ChatModel class loads the model and generates the response. A stopping criteria is configured for the generation to only produce the bot response on inference. This handles partitioning the model across multiple GPUs.
The ModerationModel class will load the model and generate the classification for moderation. If it finds that the classification is "needs intervention", the return value will be True to advise the model to censor the response to the user.
conversation.py is adapted from the open source OpenChatKit repository. This file is responsible for defining the object that stores the conversation turns between the human and the model. With this, the model is able to retain a session for the conversation allowing a user to refer to previous messages.
As SageMaker endpoint invocations are stateless, this conversation needs to be stored in a location external to the endpoint instances. On startup, the instance will create a DynamoDB table if it does not exist. All updates to the conversation are then stored in DynamoDB based on the session_id key which is generated by the endpoint. Any invocation with a session ID will retrieve the associated conversation string and update it as required.
In order to search the Wikipedia documents for relevant text, the index needs to be downloaded from HuggingFace as it is not packaged elsewhere.
This file is responsible for handling the download when imported. Only a single process in the multiple that are running for inference can clone the repository. The rest will instead wait until the files are present in the local filesystem.
This code is responsible for loading and searching the Wikipedia document index. This helps to provide additional context to the chatbot which can improve performance.
One of the other features of OpenChatKit are the moderation capabilities. While the model itself does have some moderation built in, TogetherComputer trained a GPT-JT-Moderation-6B model with Ontocord.ai's OIG-moderation dataset. This model runs alongside the main chatbot to check both the user input and answer from the bot do not contain inappropriate results. In the scenario they do, the input model will indicate to the chat model that the input is inappropriate to override the inference result, and the output model will override the inference result.
The input moderation model returns the data in a format that is readable by the bot as if it were a regular input. The output moderation model does not include this change.
Image URI for the DJL container is being used here
The index search uses Facebook's Faiss library for performing the similarity search. As this is not included in the base LMI image, the container needs to be adapted to install this library. The below defines a Dockerfile which installs Faiss from source alongside other libraries needed by the bot endpoint.
This uses the SageMaker Studio Image Build CLI to build the Docker image defined above as SageMaker Studio does not allow for Docker to be installed for building the image. This will leverage CodeBuild to remotely build the image and push it to a private ECR repository.
This same Dockerfile can be built anywhere that allows for running Docker commands and pushing to a relevant ECR repository.
Create the Tarball and then upload to S3 location
To create the endpoint the steps are:
-
Build an image adapted from the DJL container that installs Faiss for information retrieval
-
Create the Model using the Image container and the Model Tarball uploaded earlier
-
Create the endpoint config using the following key parameters
a) Instance Type is ml.g5.12xlarge
b) ContainerStartupHealthCheckTimeoutInSeconds is 2400 to ensure health check starts after the model is ready
-
Create the end point using the endpoint config created
Create the Model
Use the image URI built from the DJL container and the s3 location to which the tarball was uploaded. The moderation models will use the DJL container.
The container downloads the model into the /tmp space on the instance because SageMaker maps the /tmp to the Amazon Elastic Block Store (Amazon EBS) volume that is mounted when we specify the endpoint creation parameter VolumeSizeInGB. It leverages s5cmd(https://github.com/peak/s5cmd) which offers a very fast download speed and hence extremely useful when downloading large models.
For instances like p4dn, which come pre-built with the volume instance, we can continue to leverage the /tmp on the container. The size of this mount is large enough to hold the model.
This step can take ~ 10 min or longer so please be patient
While you wait for the endpoint to be created, you can read more about:
Leverage the Boto3 to invoke the endpoint.
This is a generative model, so we pass in a Text as a prompt and Model will complete the sentence and return the results.
You can pass a batch of prompts as input to the model. This done by setting inputs to the list of prompts. The model then returns a result for each prompt. The text generation can be configured using appropriate parameters. These parameters need to be passed to the endpoint. Refer to this documentation - https://huggingface.co/docs/transformers/main/en/main_classes/text_generation#transformers.GenerationConfig for more details.
The below code sample illustrates the invocation of the endpoint using a prompt and also sets some parameters for inference. The function allows for a session ID to be provided for re-using previous inputs and outputs as additional context for a conversation.
Clean Up
Notebook CI Test Results
This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.