Notebooks
T
Together
Multimodal Search And Conditional Image Generation

Multimodal Search And Conditional Image Generation

Multimodal Search and Conditional Image Generation

Open In Colab

Introduction

In this notebook we will demonstrate how to implement text-to-image search and image-to-image search. This will allow you to retrieve semantically relevant images and then we will use the retrieved images to condition the generation of new images using diffusion models.

We will cover:

  1. How we can use multimodal embedding models like JinaCLIP to perform multimodal search.
  2. How we can perform conditional image generation using the FLUX models.

Install relevant libraries

[11]
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 83.7/83.7 kB 5.4 MB/s eta 0:00:00
[2]
[26]
['http://leapphotography.com/blog/wp-content/uploads/2017/04/family-portrait-studio-boise-idaho-003.jpg',
, 'http://3.bp.blogspot.com/_b_LWsdjxDUI/TKJuV4BG7EI/AAAAAAAAEkM/qcdcqXqtZLc/s1600/Family+PIctures_53.jpg',
, 'https://www.uniqueideas.site/wp-content/uploads/img_6349-1087x1600-pixels-family-group-photos-pinterest-1.jpg']

We use the following code to obtain a variety of image links we can index.

# Lets create a small dataset of 12 images containing diverse topics
searches = 'forest', 'dog', 'strawberry field', 'family picture'

from time import sleep

links = []

for o in searches:
    links += search_images(o, max_images=3)
    sleep(1)
[ ]

Embed Images using JinaCLIP

JinaCLIP is an embedding model that we will use to generate vector embeddings for our 12 images above. It is trained using contrastive learning to unify text and image representations.

During contrastive training text and image pairs that are semantically similar are pulled closer together in vector space and those that are dissimilar are pushed apart.

This model allows us to perform retrieval over image and text modalities.

[37]
(12, 768)

image_embeddings is now a numpy array/vector index that contains vector representations for each of our 12 images.

Image Retrieval Function

Below we implement a retrieval function that will embed an image or text query and return the most semantically relevant image.

Since JinaCLIP is a multimodal model is can accept both text or images as input and thus our function will need to handle both text or image queries.

[40]

Below we perform text2image retrieval:

[41]
tokenizer_config.json:   0%|          | 0.00/1.24k [00:00<?, ?B/s]
vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]
tokenizer.json:   0%|          | 0.00/712k [00:00<?, ?B/s]
special_tokens_map.json:   0%|          | 0.00/695 [00:00<?, ?B/s]

This image is the most semantically similar to the text query: family pics

[42]

Conditional Image Generation Using Diffusion Models

We will use the retrieved image above to generate a holiday card cartoon version of the image above!

[44]
[45]
[46]

Image to Image Search and Conditional Generation

Next we will demonstrate using an image as a query and then used the semantically relevant retrieved image to generate another holiday cartoon generated image!

[47]
[48]
[49]
[50]

Check our how you can generated images conditioned on input images here!