Notebooks
G
Google Gemini
Tag And Caption Images

Tag And Caption Images

gemini-cookbookgemini-apiexamplesgemini
Copyright 2025 Google LLC.
[88]

Gemini API: Using Gemini API to tag and caption images

You will use the Gemini model's vision capabilities and the embedding model to add tags and captions to images of pieces of clothing.

These descriptions can be used alongside embeddings to allow you to search for specific pieces of clothing using natural language, or other images.

Setup

[89]
[90]

Configure your API key

To run the following cell, your API key must be stored in a Colab Secret named GOOGLE_API_KEY. If you don't already have an API key, or you're not sure how to create a Colab Secret, see Authentication for an example.

[91]

Downloading dataset

First, you need to download a dataset with images. It contains images of various clothing that you can use to test the model.

[92]
--2025-04-08 18:09:36--  https://storage.googleapis.com/generativeai-downloads/data/clothes-dataset.zip
Resolving storage.googleapis.com (storage.googleapis.com)... 172.217.218.207, 142.251.31.207, 142.251.18.207, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|172.217.218.207|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 730831 (714K) [application/zip]
Saving to: ‘clothes-dataset.zip.3’

clothes-dataset.zip 100%[===================>] 713.70K  1.42MB/s    in 0.5s    

2025-04-08 18:09:37 (1.42 MB/s) - ‘clothes-dataset.zip.3’ saved [730831/730831]

Unzip the data in clothes-dataset.zip and place them in a folder in your Colab environment.

[93]
Archive:  clothes-dataset.zip
  inflating: clothes-dataset/6.jpg   
  inflating: clothes-dataset/4.jpg   
  inflating: clothes-dataset/1.jpg   
  inflating: clothes-dataset/2.jpg   
  inflating: clothes-dataset/7.jpg   
  inflating: clothes-dataset/9.jpg   
  inflating: clothes-dataset/8.jpg   
  inflating: clothes-dataset/10.jpg  
  inflating: clothes-dataset/5.jpg   
  inflating: clothes-dataset/3.jpg   
[94]

Generating keywords

You can use the LLM to extract relevant keywords from the images.

Here is a helper function for calling Gemini API with images. Sleep is for ensuring that the quota is not exceeded. Refer to our princing page for current quotas.

[95]
MODEL_ID

First, define the list of possible keywords.

[96]

Go ahead and define a prompt that will help define keywords that describe clothing. In the following prompt, few-shot prompting is used to prime the LLM with examples of how these keywords should be generated and which are valid.

[97]
[98]

Generate keywords for each of the images.

[99]
Output
["shorts", "denim", "blue"]
Output
["suit", "men", "blue", "elegant"]
Output
["suit", "blue", "black", "men", "elegant"]
Output
Here are the extracted keywords:
["T-shirt", "cotton", "casual", "women", "spring", "summer", "red"]

Keyword correction and deduplication

Unfortunately, despite providing a list of possible keywords, the model, at least in theory, can return an invalid keyword. It may be a duplicate e.g. "denim" for "jeans", or be completely unrelated to any keyword from the list.

To address these issues, you can use embeddings to map the keywords to predefined ones and remove unrelated ones.

[100]
EMBEDDINGS_MODEL_ID
[101]

For demonstration purposes, define a function that assesses the similarity between two embedding vectors. In this case, you will use cosine similarity, but other measures such as dot product work too.

[102]

Next, define a function that allows you to replace a keyword with the most similar word in the keyword dataframe that you have previously created.

Note that the threshold is decided arbitrarily, it may require tweaking depending on use case and dataset.

[103]

Here is an example of how these keywords can be mapped to a keyword with the closest meaning.

[104]
purple -> violet
tank top -> T-shirt
everyday -> casual

You can now either leave words that do not fit our predefined categories or delete them. In this scenario, all words without a suitable replacement will be omitted.

[105]
{'polyester', 'women', 'white', 'sport'}
{'women', 'blue', 'casual', 'denim'}

Generating captions

[106]
Output
This is a red, short-sleeved, knee-length women's dress with a colorful floral pattern.
Output
This is a khaki button-up shirt with two chest pockets and long sleeves, designed for men.

Searching for specific clothes

Preparing out dataset

First, you need to generate caption and keywords for every image. Then, you will use embeddings, which will be used later to compare the images in the search dataset with other descriptions and images.

Also, the ast.literal_eval() helper function allows you to evaluate an object passed in and get the literal object. For instance, if you passed in a string "[1, 2, 3]", the ast.literal_eval() function would return it as a list [1, 2, 3]. For more information on this function, here is the documentation.

[107]

You will use only the first 8 images, so the rest can be used for testing.

[108]
[109]
[110]
[111]

Finding clothes using natural language

[112]
[113]
Output
[114]
Output

Finding similar clothes using images

[115]
[116]
OutputOutput
[117]
OutputOutput

Summary

You have used Gemini API's Python SDK to tag and caption images of clothing. Using embedding models, you were able to search a database of images for clothing matching our description, or similar to the provided clothing item.