Anomaly Detection With Embeddings
Copyright 2025 Google LLC.
Anomaly detection with embeddings
Overview
This tutorial demonstrates how to use the embeddings from the Gemini API to detect potential outliers in your dataset. You will visualize a subset of the 20 Newsgroup dataset using t-SNE and detect outliers outside a particular radius of the central point of each categorical cluster.
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 137.7/137.7 kB 2.9 MB/s eta 0:00:00
To run the following cell, your API key must be stored it in a Colab Secret named GOOGLE_API_KEY. If you don't already have an API key, or you're not sure how to create a Colab Secret, see the Authentication quickstart for an example.
Prepare dataset
The 20 Newsgroups Text Dataset from the open-source SciKit project contains 18,000 newsgroups posts on 20 topics divided into training and test sets. The split between the training and test datasets are based on messages posted before and after a specific date. This tutorial uses the training subset.
['alt.atheism', , 'comp.graphics', , 'comp.os.ms-windows.misc', , 'comp.sys.ibm.pc.hardware', , 'comp.sys.mac.hardware', , 'comp.windows.x', , 'misc.forsale', , 'rec.autos', , 'rec.motorcycles', , 'rec.sport.baseball', , 'rec.sport.hockey', , 'sci.crypt', , 'sci.electronics', , 'sci.med', , 'sci.space', , 'soc.religion.christian', , 'talk.politics.guns', , 'talk.politics.mideast', , 'talk.politics.misc', , 'talk.religion.misc']
Here is the first example in the training set.
Lines: 15 I was wondering if anyone out there could enlighten me on this car I saw the other day. It was a 2-door sports car, looked to be from the late 60s/ early 70s. It was called a Bricklin. The doors were really small. In addition, the front bumper was separate from the rest of the body. This is all I know. If anyone can tellme a model name, engine specs, years of production, where this car is made, history, or whatever info you have on this funky looking car, please e-mail. Thanks, - IL ---- brought to you by your neighborhood Lerxst ----
Next, sample some of the data by taking 150 data points in the training dataset and choosing a few categories. This tutorial uses the science categories.
<ipython-input-7-dc22d2141534>:5: DeprecationWarning: DataFrameGroupBy.apply operated on the grouping columns. This behavior is deprecated, and in a future version of pandas the grouping columns will be excluded from the operation. Either pass `include_groups=False` to exclude the groupings or explicitly select the grouping columns after groupby to silence this warning. .apply(lambda x: x.sample(SAMPLE_SIZE))
Create the embeddings
In this section, you will see how to generate embeddings for the different texts in the dataframe using the embeddings from the Gemini API.
API changes to Embeddings with model embedding-001
For the embeddings model, text-embedding-004, there is a task type parameter and the optional title (only valid with task_type=RETRIEVAL_DOCUMENT).
These parameters apply only to the embeddings models. The task types are:
| Task Type | Description |
|---|---|
| RETRIEVAL_QUERY | Specifies the given text is a query in a search/retrieval setting. |
| RETRIEVAL_DOCUMENT | Specifies the given text is a document in a search/retrieval setting. |
| SEMANTIC_SIMILARITY | Specifies the given text will be used for Semantic Textual Similarity (STS). |
| CLASSIFICATION | Specifies that the embeddings will be used for classification. |
| CLUSTERING | Specifies that the embeddings will be used for clustering. |
0%| | 0/6 [00:00<?, ?it/s]
Dimensionality reduction
The dimension of the document embedding vector is 768. In order to visualize how the embedded documents are grouped together, you will need to apply dimensionality reduction as you can only visualize the embeddings in 2D or 3D space. Contextually similar documents should be closer together in space as opposed to documents that are not as similar.
768
(600, 768)
You will apply the t-Distributed Stochastic Neighbor Embedding (t-SNE) approach to perform dimensionality reduction. This technique reduces the number of dimensions, while preserving clusters (points that are close together stay close together). For the original data, the model tries to construct a distribution over which other data points are "neighbors" (e.g., they share a similar meaning). It then optimizes an objective function to keep a similar distribution in the visualization.
Outlier detection
To determine which points are anomalous, you will determine which points are inliers and outliers. Start by finding the centroid, or location that represents the center of the cluster, and use the distance to determine the points that are outliers.
Start by getting the centroid of each category.
Plot each centroid you have found against the rest of the points.
Choose a radius. Anything beyond this bound from the centroid of that category is considered an outlier.
Depending on how sensitive you want your anomaly detector to be, you can choose which radius you would like to use. For now, 0.62 is used, but you can change this value.
Plot the outliers and denote them using a transparent red color.
Use the index values of the datafames to print a few examples of what outliers can look like in each category. Here, the first data point from each category is printed out. Explore other points in each category to see data that are deemed as outliers, or anomalies.
Re: The [secret] source of that announcement
Organization: DSI/USCRPAC
Lines: 23
suggests using a common but restricted-distribution private
key to allow public key system encrypted postings. In theory that will work
fine as long as the privae key remains secure.
In practice it would be a good idea to check to see if that would be a
violation of some net rule, practice, custom, etc. I don't say it would be,
just that it would be a good idea to check. This is not like rot13 where
everybody can have the key trivially.
It would also be a good idea to check to see if such posts would be
forwarded by the sites needed to make the chain work.
Of course there'd be no problem with a discussion group travelling over
facilities entirely under the control of the members. Probably there would
also be no problem with a mailing list approach. It might even be fun for
some.
--
David Sternlight Great care has been taken to ensure the accuracy of
our information, errors and omissions excepted.
Re: arcade style buttons and joysticks Organization: Antone's Italian Kitchen and Excellence in Operating Network X-Newsreader: rusnews v1.02 Lines: 26 writes: > Hi there, > Can anyone tell me where it is possible to purchase controls found > on most arcade style games. Many projects I am working on would > be greatly augmented if I could implement them. Thanx in advance. > > -Dave > > Contact Chris Arthur at He restores lots of old video and arcade games and knows where to get parts. Tony ----------------------------------------------------------------------- -- Anthony S. Pelliccio, kd1nr/ae // Yes, you read it right, the // -- system @ garlic.sbs.com // man who went from No-Code // -----------------------------------// to Extra in // -- Flame Retardent Sysadmin // exactly one year! // ------------------------------------------------------------------- -- This is a calm .sig! -- --------------------------
Re: Is MSG sensitivity superstition? Organization: your service Lines: 20 NNTP-Posting-Host: hpctdkz.col.hp.com Jason Chen writes: > Now here is a new one: vomiting. My guess is that MSG becomes the number one > suspect of any problem. In this case. it might be just food poisoning. But > if you heard things about MSG, you may think it must be it. ---------- Yeah, it might, if you only read the part you quoted. You somehow left out the part about "we all ate the same thing." Changes things a bit, eh? You complain that people blame MSG automatically, since it's an unknown and therefore must be the cause. It is equally unreasonable to defend it, automatically assuming that it CAN'T be the culprit. Pepper makes me sneeze. If it doesn't affect you the same way, fine. Just don't tell me I'm wrong for saying so. These people aren't condemning Chinese food, Mr. Chen - just one of its ingredients. Try not to take it so personally.
Re: Abyss--breathing fluids Organization: U.C. Berkeley Math. Department. Lines: 19 NNTP-Posting-Host: skippy.berkeley.edu Are breathable liquids possible? I remember seeing an old Nova or The Nature of Things where this idea was touched upon . If nothing else, I know such liquids ARE possible because... They showed a large glass full of this liquid, and put a white mouse in it. Since the liquid was not dense, the mouse would float, so it was held down by tongs clutching its tail. The thing struggled quite a bit, but it was certainly held down long enough so that it was breathing the liquid. It never did slow down in its frantic attempts to swim to the top. Now, this may not have been the most humane of demonstrations, but it certainly shows breathable liquids can be made. -- *Isaac Kuo * ___ * * _____/_o_\_____ * Twinkle, twinkle, little .sig, *(====) * Keep it less than 5 lines big. * \==\/ \/==/
Next steps
You've now created an anomaly detector using embeddings! Try using your own textual data to visualize them as embeddings, and choose some bound such that you can detect outliers. You can perform dimensionality reduction in order to complete the visualization step. Note that t-SNE is good at clustering inputs, but can take a longer time to converge or might get stuck at local minima. If you run into this issue, another technique you could consider are principal components analysis (PCA).