Notebooks
G
Google Gemini
Anomaly Detection

Anomaly Detection

gemini-cookbookgemini-apiexamplesgemini
Copyright 2025 Google LLC.
[1]

Anomaly detection with embeddings

⚠️

This notebook requires paid tier rate limits to run properly.
(cf. pricing for more details).

Overview

This tutorial demonstrates how to use the embeddings from the Gemini API to detect potential outliers in your dataset. You will visualize a subset of the 20 Newsgroup dataset using t-SNE{:.external} and detect outliers outside a particular radius of the central point of each categorical cluster.

For more information on getting started with embeddings generated from the Gemini API, check out the Get Started.

Prerequisites

You can run this quickstart in Google Colab.

To complete this quickstart on your own development environment, ensure that your envirmonement meets the following requirements:

  • Python 3.11+
  • An installation of jupyter to run the notebook.

Setup

First, download and install the Gemini API Python library.

[2]
[ ]

Grab an API Key

Before you can use the Gemini API, you must first obtain an API key. If you don't already have one, create a key with one click in Google AI Studio.

Get an API key

In Colab, add the key to the secrets manager under the "🔑" in the left panel. Give it the name GEMINI_API_KEY.

Once you have the API key, pass it to the SDK. You can do this in two ways:

  • Put the key in the GEMINI_API_KEY environment variable (the SDK will automatically pick it up from there).
  • Pass the key to genai.Client(api_key=...)
[ ]

Key Point: Next, you will choose a model. Any embedding model will work for this tutorial, but for real applications it's important to choose a specific model and stick with it. The outputs of different models are not compatible with each other.

[6]
models/embedding-001
models/text-embedding-004
models/gemini-embedding-exp-03-07
models/gemini-embedding-exp
models/gemini-embedding-001

Select the model to be used

[20]
MODEL_ID

Prepare the dataset

The 20 Newsgroups Text Dataset{:.external} contains 18,000 newsgroups posts on 20 topics divided into training and test sets. The split between the training and test datasets are based on messages posted before and after a specific date. This tutorial uses the training subset.

[7]
['alt.atheism',
, 'comp.graphics',
, 'comp.os.ms-windows.misc',
, 'comp.sys.ibm.pc.hardware',
, 'comp.sys.mac.hardware',
, 'comp.windows.x',
, 'misc.forsale',
, 'rec.autos',
, 'rec.motorcycles',
, 'rec.sport.baseball',
, 'rec.sport.hockey',
, 'sci.crypt',
, 'sci.electronics',
, 'sci.med',
, 'sci.space',
, 'soc.religion.christian',
, 'talk.politics.guns',
, 'talk.politics.mideast',
, 'talk.politics.misc',
, 'talk.religion.misc']

Here is the first example in the training set.

[8]
Lines: 15

 I was wondering if anyone out there could enlighten me on this car I saw
the other day. It was a 2-door sports car, looked to be from the late 60s/
early 70s. It was called a Bricklin. The doors were really small. In addition,
the front bumper was separate from the rest of the body. This is 
all I know. If anyone can tellme a model name, engine specs, years
of production, where this car is made, history, or whatever info you
have on this funky looking car, please e-mail.

Thanks,
- IL
   ---- brought to you by your neighborhood Lerxst ----





[9]
[55]

Next, sample some of the data by taking 150 data points in the training dataset and choosing a few categories. This tutorial uses the science categories.

[56]
/tmp/ipykernel_100019/406673449.py:4: FutureWarning: DataFrameGroupBy.apply operated on the grouping columns. This behavior is deprecated, and in a future version of pandas the grouping columns will be excluded from the operation. Either pass `include_groups=False` to exclude the groupings or explicitly select the grouping columns after groupby to silence this warning.
  .apply(lambda x: x.sample(SAMPLE_SIZE))
[57]
Class Name
,sci.crypt          150
,sci.electronics    150
,sci.med            150
,sci.space          150
,Name: count, dtype: int64

Generate the embeddings

In this section, you will see how to generate embeddings for the different texts in the dataframe using the embeddings from the Gemini API.

The Gemini embedding model supports several task types, each tailored for a specific goal. Here’s a general overview of the available types and their applications:

Task TypeDescription
RETRIEVAL_QUERYSpecifies the given text is a query in a search/retrieval setting.
RETRIEVAL_DOCUMENTSpecifies the given text is a document in a search/retrieval setting.
SEMANTIC_SIMILARITYSpecifies the given text will be used for Semantic Textual Similarity (STS).
CLASSIFICATIONSpecifies that the embeddings will be used for classification.
CLUSTERINGSpecifies that the embeddings will be used for clustering.
[ ]
100%|██████████| 600/600 [06:11<00:00,  1.62it/s]

Dimensionality reduction

The dimension of the document embedding vector is 3072. In order to visualize how the embedded documents are grouped together, you will need to apply dimensionality reduction as you can only visualize the embeddings in 2D or 3D space. Contextually similar documents should be closer together in space as opposed to documents that are not as similar.

[67]
3072
[68]
(600, 3072)

You will apply the t-Distributed Stochastic Neighbor Embedding (t-SNE) approach to perform dimensionality reduction. This technique reduces the number of dimensions, while preserving clusters (points that are close together stay close together). For the original data, the model tries to construct a distribution over which other data points are "neighbors" (e.g., they share a similar meaning). It then optimizes an objective function to keep a similar distribution in the visualization.

[73]
[71]
[74]
Text(0, 0.5, 'TSNE2')
Output

Outlier detection

To determine which points are anomalous, you will determine which points are inliers and outliers. Start by finding the centroid, or location that represents the center of the cluster, and use the distance to determine the points that are outliers.

Start by getting the centroid of each category.

[75]
[76]
[77]

Plot each centroid you have found against the rest of the points.

[ ]
Output

Choose a radius. Anything beyond this bound from the centroid of that category is considered an outlier.

[78]
[79]
[83]
Output

Depending on how sensitive you want your anomaly detector to be, you can choose which radius you would like to use. For now, 0.58 is used, but you can change this value.

[96]
[97]

Plot the outliers and denote them using a transparent red color.

[98]
Text(0, 0.5, 'TSNE2')
Output

Use the index values of the datafames to print a few examples of what outliers can look like in each category. Here, the first data point from each category is printed out. Explore other points in each category to see data that are deemed as outliers, or anomalies.

[100]
Electric power line "balls"
Article-I.D.: almaden.19930406.142616.248
Lines: 4

Power lines and airplanes don't mix. In areas where lines are strung very
high, or where a lot of crop dusting takes place, or where there is danger
of airplanes flying into the lines, they place these plastic balls on the
lines so they are easier to spot.

[101]
LARSONIAN Astronomy and Physics
Organization: University of Wisconsin Eau Claire
Lines: 552



                      LARSONIAN Astronomy and Physics

               Orthodox physicists, astronomers, and astrophysicists 
          CLAIM to be looking for a "Unified Field Theory" in which all 
          of the forces of the universe can be explained with a single 
          set of laws or equations.  But they have been systematically 
          IGNORING or SUPPRESSING an excellent one for 30 years! 

               The late Physicist Dewey B. Larson's comprehensive 
          GENERAL UNIFIED Theory of the physical universe, which he 
          calls the "Reciprocal System", is built on two fundamental 
          postulates about the physical and mathematical natures of 
          space and time: 
    
                "The physical universe is composed ENTIRELY of ONE 
          component, MOTION, existing in THREE dimensions, in DISCRETE 
          UNITS, and in two RECIPROCAL forms, SPACE and TIME." 
    
                "The physical universe conforms to the relations of 
          ORDINARY COMMUTATIVE mathematics, its magnitudes are 
          ABSOLUTE, and its geometry is EUCLIDEAN." 
    
               From these two postulates, Larson developed a COMPLETE 
          Theoretical Universe, using various combinations of 
          translational, vibrational, rotational, and vibrational-
          rotational MOTIONS, the concepts of IN-ward and OUT-ward 
          SCALAR MOTIONS, and speeds in relation to the Speed of Light 
          . 
      
               At each step in the development, Larson was able to 
          MATCH objects in his Theoretical Universe with objects in the 
          REAL physical universe, , even objects NOT YET 
          DISCOVERED THEN . 
          
               And applying his Theory to his NEW model of the atom, 
          Larson was able to precisely and accurately CALCULATE inter-
          atomic distances in crystals and molecules, compressibility 
          and thermal expansion of solids, and other properties of 
          matter. 

               All of this is described in good detail, with-OUT fancy 
          complex mathematics, in his books. 
    


          BOOKS of Dewey B. Larson
          
               The following is a complete list of the late Physicist 
          Dewey B. Larson's books about his comprehensive GENERAL 
          UNIFIED Theory of the physical universe.  Some of the early 
          books are out of print now, but still available through 
          inter-library loan. 
    
               "The Structure of the Physical Universe"  
    
               "The Case AGAINST the Nuclear Atom" 
    
               "Beyond Newton"  
    
               "New Light on Space and Time"  
    
               "Quasars and Pulsars"  
    
               "NOTHING BUT MOTION"  
                    [A $9.50 SUBSTITUTE for the $8.3 BILLION "Super 
                                                            Collider".] 
                    [The last four chapters EXPLAIN chemical bonding.]

               "The Neglected Facts of Science"  
     
               "THE UNIVERSE OF MOTION" 
                    [FINAL SOLUTIONS to most ALL astrophysical
                                                            mysteries.] 
      
               "BASIC PROPERTIES OF MATTER" 

               All but the last of these books were published by North 
          Pacific Publishers, P.O. Box 13255, Portland, OR  97213, and 
          should be available via inter-library loan if your local 
          university or public library doesn't have each of them. 

               Several of them, INCLUDING the last one, are available 
          from: The International Society of Unified Science , 
          1680 E. Atkin Ave., Salt Lake City, Utah  84106.  This is the 
          organization that was started to promote Larson's Theory.  
          They have other related publications, including the quarterly 
          journal "RECIPROCITY". 

          

          Physicist Dewey B. Larson's Background
    
               Physicist Dewey B. Larson was a retired Engineer 
          .  He was about 91 years old when he 
          died in May 1989.  He had a Bachelor of Science Degree in 
          Engineering Science from Oregon State University.  He 
          developed his comprehensive GENERAL UNIFIED Theory of the 
          physical universe while trying to develop a way to COMPUTE 
          chemical properties based only on the elements used. 
    
               Larson's lack of a fancy "PH.D." degree might be one 
          reason that orthodox physicists are ignoring him, but it is 
          NOT A VALID REASON.  Sometimes it takes a relative outsider 
          to CLEARLY SEE THE FOREST THROUGH THE TREES.  At the same 
          time, it is clear from his books that he also knew ORTHODOX 
          physics and astronomy as well as ANY physicist or astronomer, 
   

Next steps

You've now created an anomaly detector using embeddings! Try using your own textual data to visualize them as embeddings, and choose some bound such that you can detect outliers. You can perform dimensionality reduction in order to complete the visualization step. Note that t-SNE is good at clustering inputs, but can take a longer time to converge or might get stuck at local minima.

To learn how to use other services in the Gemini API, see the Get started guide.