Notebooks
H
Hugging Face
Deepfloyd If Free Tier Google Colab

Deepfloyd If Free Tier Google Colab

hf-notebooksdiffusers

Open In Colab

Running IF with 🧨 diffusers on a Free Tier Google Colab

TL;DR: We show how to run one of the most powerful open-source text to image models IF on a free-tier Google Colab with 🧨 diffusers.

by DeepFloyd & 🤗 HuggingFace

nabla.jpg

Image taken from official IF GitHub repo here

Introduction

IF is a pixel-based text-to-image generation model and was released in late April 2023 by DeepFloyd. The model architecture is strongly inspired by Google's closed-sourced Imagen.

IF has two distinct advantages compared to existing text-to-image models like Stable Diffusion:

  • The model operates directly in "pixel space" (i.e., on uncompressed images) instead of running the denoising process in the latent space such as Stable Diffusion.
  • The model is trained on outputs of T5-XXL, a more powerful text encoder than CLIP, used by Stable Diffusion as the text encoder.

As a result, IF is better at generating images with high-frequency details (e.g., human faces, and hands) and is the first open-source image generation model that can reliably generate images with text.

The downside of operating in pixel space and using a more powerful text encoder is that IF has a significantly higher amount of parameters. T5, IF's text-to-image UNet, and IF's upscaler UNet have 4.5B, 4.3B, and 1.2B parameters respectively. Compared this to Stable Diffusion 2.1's text encoder and UNet having just 400M and 900M parameters respectively.

Nevertheless, it is possible to run IF on consumer hardware if one optimizes the model for low-memory usage. We will show you can do this with 🧨 diffusers in this blog post.

In 1.), we explain how to use IF for text-to-image generation and in 2.) and 3.) we go over IF's image variation and image inpainting capabilities.

💡 Note: To a big part we are trading gains in memory by gains in speed here to make it possible to run IF in a free-tier Google Colab. If you have access to high-end GPUs such as a A100, we recommend to simply leave all model components on GPU for maximum speed as done in the official IF demo.

Let's dive in 🚀!

Optimizing IF to run on memory constrained hardware

State-of-the-art ML should not just be in the hands of an elite few. Democratizing ML means making models available to run on more than just the latest and greatest hardware.

The deep learning community has created world class tools to run resource intensive models on consumer hardware:

Diffusers seemlessly integrates the above libraries to allow for a simple API when optimizing large models.

The free-tier Google Colab is both CPU RAM constrained (13 GB RAM) as well as GPU VRAM constrained (15 GB RAM for T4) which makes running the whole >10B IF model challenging!

Let's map out the size of IF's model components in full float32 precision:

There is no way we can run the model in float32 as the T5 and Stage 1 UNet weights are each larger than the available CPU RAM.

In float16, the component sizes are 11GB, 8.6GB and 1.25GB for T5, Stage1 and Stage2 UNets respectively which is doable for the GPU, but we're still running into CPU memory overflow errors when loading the T5 (some CPU is occupied by other processes).

Therefore, we lower the precision of T5 even more by using bitsandbytes 8bit quantization which allows to save the T5 checkpoint with as little as 8 GB.

Now that each components fits individually into both CPU and GPU memory, we need to make sure that components have all the CPU and GPU memory for themselves when needed.

Diffusers supports modularly loading individual components i.e. we can load the text encoder without loading the unet. This modular loading will ensure that we only load the component we need at a given step in the pipeline to avoid exhausting the available CPU RAM and GPU VRAM.

Let's give it a try 🚀

Available resources

The free-tier Google Colab comes with around 13 GB CPU RAM:

[ ]
MemTotal:       13297192 kB

And an NVIDIA T4 with 15 GB VRAM:

[ ]
Wed Apr 26 14:25:11 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   48C    P0    27W /  70W |   8677MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

Install dependencies

Some optimizations can require up-to-date versions of dependencies. If you are having issues, please double check and upgrade versions.

[ ]

Accepting the license

Before you can use IF, you need to accept its usage conditions. To do so:

    1. Make sure to have a Hugging Face account and be loggin in
    1. Accept the license on the model card of DeepFloyd/IF-I-XL-v1.0. Accepting the license on the stage I model card will auto accept for the other IF models.
    1. Make sure to login locally. Install huggingface_hub
[ ]

and run the login function in a Python shell

[ ]

1. Text-to-image generation

We will walk step by step through text-to-image generation with IF using Diffusers. We will explain briefly APIs and optimizations, but more in-depth explanations can be found in the official documentation for Diffusers, Transformers, Accelerate, and bitsandbytes.

1.1 Load text encoder

We will load T5 using 8bit quantization. Transformers directly supports bitsandbytes through the load_in_8bit flag.

The flag variant="8bit" will download pre-quantized weights.

We also use the device_map flag to allow transformers to offload model layers to the CPU or disk. Transformers big modeling supports arbitrary device maps which can be used to separately load model parameters directly to available devices. Passing "auto" will automatically create a device map. See the transformers docs for more information.

[ ]

1.2 Create text embeddings

The Diffusers API for accessing diffusion models is the DiffusionPipeline class and its subclasses. Each instance of DiffusionPipeline is a fully self contained set of methods and models for running diffusion networks. We can override the models it uses by passing alternative instances as keyword arguments to from_pretrained.

In this case, we pass None for the unet argument so no UNet will be loaded. This allows us to run the text embedding portion of the diffusion process without loading the UNet into memory.

[ ]

IF also comes with a super resolution pipeline. We will save the prompt embeddings so that we can later directly pass them to the super resolution pipeline. This will allow the super resolution pipeline to be loaded without a text encoder.

Instead of an astronaut just riding a horse, let's hand him a sign as well!

Let's define a fitting prompt:

[ ]

and run it through the 8bit quantized T5 model:

[ ]

1.3 Free memory

Once the prompt embeddings have been created. We do not need the text encoder anymore. However, it is still in memory on the GPU. We need to remove it so that we can load the UNet.

It's non-trivial to free PyTorch memory. We must garbage collect the Python objects which point to the actual memory allocated on the GPU.

First, use the python keyword del to delete all python objects referencing allocated GPU memory

[ ]

The deletion of the python object is not enough to free the GPU memory. Garbage collection is when the actual GPU memory is freed.

Additionally, we will call torch.cuda.empty_cache(). This method isn't strictly necessary as the cached cuda memory will be immediately available for further allocations. Emptying the cache allows us to verify in the colab UI that the memory is available.

We'll use a helper function flush() to flush memory.

[ ]

and run it

[ ]

1.4 Stage 1: The main diffusion process

With our now available GPU memory, we can re-load the DiffusionPipeline with only the UNet to run the main diffusion process.

The variant and torch_dtype flags are used by Diffusers to download and load the weights in 16 bit floating point format.

[ ]

A mixture of fp16 and non-fp16 filenames will be loaded.
Loaded fp16 filenames:
[unet/diffusion_pytorch_model.fp16.safetensors, text_encoder/model.fp16-00001-of-00002.safetensors, text_encoder/model.fp16-00002-of-00002.safetensors, safety_checker/model.fp16.safetensors]
Loaded non-fp16 filenames:
[watermarker/diffusion_pytorch_model.safetensors
If this behavior is not expected, please check your folder structure.
Fetching 12 files:   0%|          | 0/12 [00:00<?, ?it/s]
Downloading (…)4ff/unet/config.json:   0%|          | 0.00/1.61k [00:00<?, ?B/s]
Downloading (…)del.fp16.safetensors:   0%|          | 0.00/8.61G [00:00<?, ?B/s]
Downloading model.fp16.safetensors:   0%|          | 0.00/608M [00:00<?, ?B/s]

Often, we directly pass the text prompt to DiffusionPipeline.__call__. However, we previously computed our text embeddings which we can pass instead.

IF also comes with a super resolution diffusion process. Setting output_type="pt" will return raw PyTorch tensors instead of a PIL image. This way we can keep the Pytorch tensors on GPU and pass them directly to the stage 2 super resolution pipeline.

Let's define a random generator and run the stage 1 diffusion process.

[ ]
  0%|          | 0/100 [00:00<?, ?it/s]

Let's manually convert the raw tensors to PIL and have a sneak peak at the final result. The output of stage 1 is a 64x64 image.

[ ]
Output

And again, we remove the Python pointer and free CPU and GPU memory:

[ ]

1.5 Stage 2: Super Resolution 64x64 to 256x256

IF comes with a separate diffusion process for upscaling.

We run each diffusion process with a separate pipeline.

The super resolution pipeline can be loaded with a text encoder if needed. However, we will usually have pre-computed text embeddings from the first IF pipeline. If so, load the pipeline without the text encoder.

Create the pipeline

[ ]
Downloading (…)ain/model_index.json:   0%|          | 0.00/692 [00:00<?, ?B/s]

A mixture of fp16 and non-fp16 filenames will be loaded.
Loaded fp16 filenames:
[unet/diffusion_pytorch_model.fp16.safetensors, text_encoder/model.fp16-00001-of-00002.safetensors, text_encoder/model.fp16-00002-of-00002.safetensors, safety_checker/model.fp16.safetensors]
Loaded non-fp16 filenames:
[watermarker/diffusion_pytorch_model.safetensors
If this behavior is not expected, please check your folder structure.
Fetching 13 files:   0%|          | 0/13 [00:00<?, ?it/s]
Downloading (…)rocessor_config.json:   0%|          | 0.00/518 [00:00<?, ?B/s]
Downloading (…)cheduler_config.json:   0%|          | 0.00/424 [00:00<?, ?B/s]
Downloading (…)_checker/config.json:   0%|          | 0.00/4.92k [00:00<?, ?B/s]
Downloading (…)cial_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]
Downloading (…)cheduler_config.json:   0%|          | 0.00/424 [00:00<?, ?B/s]
Downloading (…)okenizer_config.json:   0%|          | 0.00/2.50k [00:00<?, ?B/s]
Downloading model.fp16.safetensors:   0%|          | 0.00/608M [00:00<?, ?B/s]
Downloading spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]
Downloading (…)e65/unet/config.json:   0%|          | 0.00/1.68k [00:00<?, ?B/s]
Downloading (…)ermarker/config.json:   0%|          | 0.00/74.0 [00:00<?, ?B/s]
Downloading (…)del.fp16.safetensors:   0%|          | 0.00/2.49G [00:00<?, ?B/s]
Downloading (…)ch_model.safetensors:   0%|          | 0.00/15.5k [00:00<?, ?B/s]
`text_config_dict` is provided which will be used to initialize `CLIPTextConfig`. The value `text_config["id2label"]` will be overriden.

and run it, re-using the pre-computed text embeddings

[ ]
  0%|          | 0/50 [00:00<?, ?it/s]

Again we can inspect the intermediate results.

[ ]
Output

And again, we delete the Python pointer and free memory

[ ]

1.6 Stage 3: Super Resolution 256x256 to 1024x1024

The second super resolution model for IF is the previously release Stability AI's x4 Upscaler.

Let's create the pipeline and load it directly on GPU with device_map="auto".

Note that device_map="auto" with certain pipelines will throw errors in diffusers versions from v0.16-v0.17. You will either have to upgrade to a later version if one exists or install from main with pip install git+https://github.com/huggingface/diffusers.git@main

[ ]
Downloading (…)ain/model_index.json:   0%|          | 0.00/485 [00:00<?, ?B/s]
Fetching 13 files:   0%|          | 0/13 [00:00<?, ?it/s]
Downloading model.safetensors:   0%|          | 0.00/1.36G [00:00<?, ?B/s]
Downloading (…)cheduler_config.json:   0%|          | 0.00/348 [00:00<?, ?B/s]
Downloading (…)cheduler_config.json:   0%|          | 0.00/300 [00:00<?, ?B/s]
Downloading (…)_encoder/config.json:   0%|          | 0.00/634 [00:00<?, ?B/s]
Downloading (…)cial_tokens_map.json:   0%|          | 0.00/460 [00:00<?, ?B/s]
Downloading (…)tokenizer/merges.txt:   0%|          | 0.00/525k [00:00<?, ?B/s]
Downloading (…)okenizer_config.json:   0%|          | 0.00/825 [00:00<?, ?B/s]
Downloading (…)tokenizer/vocab.json:   0%|          | 0.00/1.06M [00:00<?, ?B/s]
Downloading (…)ch_model.safetensors:   0%|          | 0.00/1.89G [00:00<?, ?B/s]
Downloading (…)b4b/unet/config.json:   0%|          | 0.00/982 [00:00<?, ?B/s]
Downloading (…)0b4b/vae/config.json:   0%|          | 0.00/587 [00:00<?, ?B/s]
Downloading (…)ch_model.safetensors:   0%|          | 0.00/221M [00:00<?, ?B/s]

🧨 diffusers makes independently developed diffusion models easily composable as pipelines can be chained together. Here we can just take the previous PyTorch tensor output and pass it to the tage 3 pipeline as image=image.

💡 Note: The x4 Upscaler does not use T5 and has its own text encoder. Therefore, we cannot use the previously created prompt embeddings and instead must pass the original prompt.

[ ]
  0%|          | 0/75 [00:00<?, ?it/s]

Unlike the IF pipelines, the IF watermark will not be added by default to outputs from the Stable Diffusion x4 upscaler pipeline.

We can instead manually apply the watermark.

[ ]
[<PIL.Image.Image image mode=RGB size=1024x1024 at 0x7F7B899CF910>]

View output image

[ ]
Output

Et voila! A beautiful 1024x1024 image in a free-tier Google Colab.

We have shown how 🧨 diffusers makes it easy to decompose and modularly load resource intensive diffusion models.

💡 Note: We don't recommend using the above setup in production. 8bit quantization, manual de-allocation of model weights, and disk offloading all trade off memory for time (i.e., inference speed). This can be especially noticable if the diffusion pipeline is re-used. In production, we recommend using a 40GB A100 with all model components left on the GPU. See the official IF demo.

2. Image variation

The same IF checkpoints can also be used for text guided image variation and inpainting. The core diffusion process is the same as text to image generation except the initial noised image is created from the image to be varied or inpainted.

To run image variation, load the same checkpoints with IFImg2ImgPipeline.from_pretrained() and IFImg2ImgSuperResolution.from_pretrained().

The APIs for memory optimization are all the same!

Let's free the memory from the previous section.

[ ]
[ ]

For image variation, we start with an initial image that we want to adapt.

For this section, we will adapt the famous "Slaps Roof of Car" meme. Let's download it from the internet.

[ ]

and load it into a PIL Image

[ ]
Output

The image variation pipeline take both PIL images and raw tensors. View the docstrings for more indepth documentation on expected inputs, here

2.1 Text Encoder

Image variation is guided by text, so we can define a prompt and encode it with T5's Text Encoder.

Again we load the text encoder into 8bit precision.

[ ]
Overriding torch_dtype=None with `torch_dtype=torch.float16` due to requirements of `bitsandbytes` to enable model loading in mixed int8. Either pass torch_dtype=torch.float16 or don't pass this argument at all to remove this warning.

For image variation, we load the checkpoint with IFImg2ImgPipeline. When using DiffusionPipeline.from_pretrained(...), checkpoints are loaded into their default pipeline. The default pipeline for the IF is the text-to-image IFPipeline. When loading checkpoints with a non-default pipeline, the pipeline must be explicitly specified.

[ ]

Let's turn our salesman into an anime character.

[ ]

As before, we create the text embeddings with T5

[ ]

and free GPU and CPU memory.

First remove the Python pointers

[ ]

and then free the memory

[ ]

2.2 Stage 1: The main diffusion process

Next, we only load the stage 1 UNet weights into the pipeline object, just like we did in the previous section.

[ ]

A mixture of fp16 and non-fp16 filenames will be loaded.
Loaded fp16 filenames:
[unet/diffusion_pytorch_model.fp16.safetensors, text_encoder/model.fp16-00001-of-00002.safetensors, text_encoder/model.fp16-00002-of-00002.safetensors, safety_checker/model.fp16.safetensors]
Loaded non-fp16 filenames:
[watermarker/diffusion_pytorch_model.safetensors
If this behavior is not expected, please check your folder structure.

The image variation pipeline requires both the original image and the prompt embeddings.

We can optionally use the strength argument to configure the amount of variation. strength directly controls the amount of noise added. Higher strength means more noise which means more variation.

[ ]
  0%|          | 0/56 [00:00<?, ?it/s]

Let's check the intermediate 64x64 again.

[ ]
Output

Looks good, we can free the memory and upscale the image again.

[ ]
[ ]

2.3 Stage 2: Super Resolution

For super resolution, load the checkpoint with IFImg2ImgSuperResolutionPipeline and the same checkpoint as before.

[ ]

A mixture of fp16 and non-fp16 filenames will be loaded.
Loaded fp16 filenames:
[unet/diffusion_pytorch_model.fp16.safetensors, text_encoder/model.fp16-00001-of-00002.safetensors, text_encoder/model.fp16-00002-of-00002.safetensors, safety_checker/model.fp16.safetensors]
Loaded non-fp16 filenames:
[watermarker/diffusion_pytorch_model.safetensors
If this behavior is not expected, please check your folder structure.
`text_config_dict` is provided which will be used to initialize `CLIPTextConfig`. The value `text_config["id2label"]` will be overriden.

💡 Note: The image variation super resolution pipeline requires the generated image as well as the original image.

You can also use the Stable Diffusion x4 upscaler on this image. Feel free to try it out using the code snippets in section 1.6.

[ ]
  0%|          | 0/40 [00:00<?, ?it/s]
Output

Nice! Let's free the memory and look at the final inpainting pipelines.

[ ]
[ ]

3. Inpainting

The IF inpainting pipeline is the same as the image variation except only a select area of the image is denoised.

We specify the area to inpatint with an image mask.

Let's show off IF's amazing "letter generation" capabilities. We can replace this sign text with different slogan.

First let's download the image

[ ]

and turn it into a PIL Image

[ ]
Output

We will mask the sign so we can replace its text.

For convenience, we have pre-generated the mask and loaded it into a HF dataset.

Let's download it.

[ ]
Downloading sign_man_mask.png:   0%|          | 0.00/1.22k [00:00<?, ?B/s]
Output

💡 Note: You can create masks yourself by manually creating a greyscale image.

[ ]
Output

Now we can start inpainting 🎨🖌

3.1. Text Encoder

Again, we load the text encoder first

[ ]
Overriding torch_dtype=None with `torch_dtype=torch.float16` due to requirements of `bitsandbytes` to enable model loading in mixed int8. Either pass torch_dtype=torch.float16 or don't pass this argument at all to remove this warning.

This time, we initialize the IFInpaintingPipeline in-painting pipeline with the text encoder weights.

[ ]

Alright, let's have the man advertise for more layers instead.

[ ]

Having defined the prompt, we can create the prompt embeddings

[ ]

Just like before we free the memory

[ ]
[ ]

3.2 Stage 1: The main diffusion process

Just like before we now load the stage 1 pipeline with only the UNet.

[ ]

A mixture of fp16 and non-fp16 filenames will be loaded.
Loaded fp16 filenames:
[unet/diffusion_pytorch_model.fp16.safetensors, text_encoder/model.fp16-00001-of-00002.safetensors, text_encoder/model.fp16-00002-of-00002.safetensors, safety_checker/model.fp16.safetensors]
Loaded non-fp16 filenames:
[watermarker/diffusion_pytorch_model.safetensors
If this behavior is not expected, please check your folder structure.

Now, we need to pass the input image, the mask image, and the prompt embeddings.

[ ]
  0%|          | 0/50 [00:00<?, ?it/s]

Let's take a look at the intermediate output.

[ ]
Output

Looks good! The text is pretty consistent!

Let's free the memory so we can upscale the image

[ ]
[ ]

3.3 Stage 2: Super Resolution

For super resolution, load the checkpoint with IFInpaintingSuperResolutionPipeline.

[ ]

A mixture of fp16 and non-fp16 filenames will be loaded.
Loaded fp16 filenames:
[unet/diffusion_pytorch_model.fp16.safetensors, text_encoder/model.fp16-00001-of-00002.safetensors, text_encoder/model.fp16-00002-of-00002.safetensors, safety_checker/model.fp16.safetensors]
Loaded non-fp16 filenames:
[watermarker/diffusion_pytorch_model.safetensors
If this behavior is not expected, please check your folder structure.
`text_config_dict` is provided which will be used to initialize `CLIPTextConfig`. The value `text_config["id2label"]` will be overriden.

The inpainting super resolution pipeline requires the generated image, the original image, the mask image, and the prompt embeddings.

Let's do a final denoising run.

[ ]
  0%|          | 0/80 [00:00<?, ?it/s]
Output

Nice, the model managed to generate text without making a single spelling error!

Conclusion

IF in 32 bit floating point precision uses 40 GB of weights in total. We showed how using only open source models and libraries, IF can be ran on a free-tier Google Colab instance.

The ML ecosystem benefits deeply from the sharing of open tools and open models. This notebook alone used models from DeepFloyd, StabilityAI, and LAION. The libraries used -- Diffusers, Transformers, Accelerate, and bitsandbytes -- all benefit from countless contributors from different organizations.

A massive thank you to the DeepFloyd team for the creation and open sourcing of IF, and for contributing to the democratization of good machine learning 🤗.