Sd Txt2img
Optimized Stable Diffusion Deployments
This notebook's CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook.
This notebook will show you how to deploy an AITemplate Optimized version of Stable Diffusion which delivers 2X performance gain versus a standard version without sacrificing the quality of the generated images.
Additionally, this notebook will demonstrate how to deploy an endpoint with pagination capabilities that would allow the API caller to display intermediate de-noising steps and reducing the initial latency to subsecond range. This enhances the end-user expereince by providing more immediate results and showing a smooth animation of the end to end image generation process. However this comes at an additional compute cost of decoding intermediate latent outputs.
We have provided compiled AITemplate weights for the ml.g5 class of instances. You can compile these on your own using the instructions here.
Deploy Model
In this section we will package the model configuration and inference code and deploy it to a SageMaker Endpoint. The following are the steps to deploy the endpoint:
- Update the
serving.propertiesconfiguration file with the location of the compiled model artifacts. More information on the supported configurations can be found here - Package the inference code along with the configuration file into a
model.tar.gz - Upload the
model.tar.gzto an S3 bucket - Deploy the model using the
deploy_modelhelper function
The inference code is contained within the model.py file in the model source directory. We use an environment variable PAGINATION to indicate whether to use the standard pipeline which will only return the final image, or a pagination based pipeline which will return intermediate results of each de-noising step. The code for each pipeline is contained within it's own python module:
- pipeline_stable_diffusion_ait.py - Code for the standard pipeline
- pipeline_stable_diffusion_pagination_ait.py - Code for the paginated pipeline
The pipelines require AITemplate to be installed in the inference container. As of 4/2023 AITemplate is not available from PyPi and must be installed by building from source code as per the instructions in the git repo. For convinience, we've included a pre-compiled python wheel model/ait/aitemplate-0.3.dev0-py3-none-any.whl that will be installed when the endpoint is launched
The inference code supports both paginated and non paginated responses which is controlled by the PAGINATION environment variable
Here we will deploy the endpoint without pagination by setting the environment variable to false
Enable Pagination
To enable pagination of intermediate results, we set the PAGINATION environment variable to true and redeploy the endpoint. Rather than just a single image within its response, the paginated endpoint contains 3 values in its response:
- Batch of intermediate images encoded as base64 encoded JPEGs
- A safetensor value for the last latent in the generation pipeline encoded as base64
- The last step number in the generation pipeline
Items 2 and 3 enable the pagination. By providing a latent tensor and the step number, we can bypass the completed steps and pick up the image generation from the last completed step. Essentially after receiving the initial batch of intermediate images, we invoke the endpoint again this time providing the latent input and the step number. This process repeats until the specified number of denoising steps are completed. This allows for a next batch of images to be fetched concurrently while intermediate frames are displayed to the user.
The function bellow encapsulates the process for querying the pagination endpoint. It provides a python iterator than we can iterate through to display the intermediate images. A background thread is used to fetch susbsequent batches of images to simulate the concurrentcy aspect
We can see from above that the first batch was delivered in under one second. This provides a more immediate response to the user at the expense of additional compute cost of having to decode intermediate images. It also doubles the time to generate the final image.
Conclusion
In this notebook we saw how we can deploy an AITemplate optimized Stable Diffusion model which offers a 2X peformance increase without sacrificing quality of the generated image. We also saw how we can provide a beter User Experience by returning intermediate results which provides a faster initial response time and a look into the image generation process.
Notebook CI Test Results
This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.