Notebooks
E
Elastic
Newsgroup Clustering

Newsgroup Clustering

openai-chatgptlangchain-pythonchatgptgenaielasticsearchelasticopenaiAIclusteringchatlogvectordatabasePythonsearchgenaistacksupporting-blog-contentvectorelasticsearch-labslangchainapplications
[ ]

Dataset Preparation

Configure the environment variables required for the Azure OpenAI service to generate embeddings. Ensure the following environment variables are set before executing the code:

  • AZURE_OPENAI_API_KEY
  • AZURE_OPENAI_ENDPOINT
  • AZURE_OPENAI_DEPLOYMENT_NAME - this should point to a text-ada-002 endpoint
[1]

First we define some helper functions for generating the text embeddings:

[2]

Get Newsgroup data for 5 categories

  • rec.sport.baseball
  • rec.sport.hockey
  • comp.sys.ibm.pc.hardware
  • talk.religion.misc
  • sci.med
[3]
'From: glang@slee01.srl.ford.com (Gordon Lang)\nSubject: Re: IP numbers on Ethernet Cards\nOrganization: Ford Motor Company Research Laboratory\nLines: 30\nNNTP-Posting-Host: slee01.srl.ford.com\nX-Newsreader: Tin 1.1 PL5\n\nTigger (djohnson@moose.uvm.edu) wrote:\n: Hi!\n: \t\n: Is it possible through either pin configuration or through software\n: programming to change the IP numbers on an ethernet card?\n: \t\n: Thanks in Advance!\n: \n: -- \n: =-Dave   *Tigger!*\n: \n: djohnson@moose.uvm.edu        \'Tiggers are wonderful things!\'\n: Dave C Johnson\n\nI think you mean the ethernet numbers.  The 8 byte ethernet id is the unique\nElectronic Serial Number (ESN) assigned to each ethernet board in existence.\nThis is a "physical layer" concept.  The IP address is a higher layer protocol.\nThe analogy to telephone service is the IP address is your phone number, while\nthe particular wire pair in the cable on the pole has some (unknown to you or\nI) physical identification scheme (number).\n\nBut to answer your question (assuming you indeed meant the Ethernet number)\nit is not supposed to be possible to change the number.  Of course the\nmanufacturer can always retro-fit a board, but there could hardly be a\nreason to ever do that.\n\nIf your question is actually referring to the IP address, it is most definetly\nchangable.  But it is strictly software.\n\nGordon Lang\n'
[25]
{0: 'comp.sys.ibm.pc.hardware',
, 1: 'rec.sport.baseball',
, 2: 'rec.sport.hockey',
, 3: 'sci.med',
, 4: 'talk.religion.misc'}

Vectorize the data

Get the text-ada-002 vectors for this dataset. For the purpose of this demo, each document is chunked in a simple way and we take the maximum of the chunk vectors for each document.

[20]
2025-01-23 10:41:29.603 | INFO     | __main__:<module>:4 - Creating embeddings for newsgroup data. The length of the vectors is 4593
100%|██████████| 4593/4593 [18:49<00:00,  4.07it/s]
2025-01-23 11:00:19.268 | INFO     | __main__:<module>:18 - Embeddings created for newsgroup data.
2025-01-23 11:00:19.269 | INFO     | __main__:<module>:19 - Storing dataset to data/openai_vectorized_dataset.json
[21]
dict_keys(['news_body', 'target', 'openai_vector'])
[22]
comp.sys.ibm.pc.hardware
[23]
From: glang@slee01.srl.ford.com (Gordon Lang)
Subject: Re: IP numbers on Ethernet Cards
Organization: Ford Motor Company Research Laboratory
Lines: 30
NNTP-Posting-Host: slee01.srl.ford.com
X-Newsreader: Tin 1.1 PL5

Tigger (djohnson@moose.uvm.edu) wrote:
: Hi!
: 	
: Is it possible through either pin configuration or through software
: programming to change the IP numbers on an ethernet card?
: 	
: Thanks in Advance!
: 
: -- 
: =-Dave   *Tigger!*
: 
: djohnson@moose.uvm.edu        'Tiggers are wonderful things!'
: Dave C Johnson

I think you mean the ethernet numbers.  The 8 byte ethernet id is the unique
Electronic Serial Number (ESN) assigned to each ethernet board in existence.
This is a "physical layer" concept.  The IP address is a higher layer protocol.
The analogy to telephone service is the IP address is your phone number, while
the particular wire pair in the cable on the pole has some (unknown to you or
I) physical identification scheme (number).

But to answer your question (assuming you indeed meant the Ethernet number)
it is not supposed to be possible to change the number.  Of course the
manufacturer can always retro-fit a board, but there could hardly be a
reason to ever do that.

If your question is actually referring to the IP address, it is most definetly
changable.  But it is strictly software.

Gordon Lang

[26]
[-0.003231409704312682, -0.001362027251161635, -0.008819126524031162, -0.03726506605744362, -0.0034561441279947758]

K-means clustering

Compute cluster centers for a dataset using KMeans and save them to a JSON file.

Cluster centers will be stored in data/openai_cluster_centers.json.

[28]
2025-01-23 11:01:14.091 | INFO     | __main__:<module>:15 - Started training Kmeans model...
2025-01-23 11:01:14.223 | INFO     | __main__:<module>:23 - Kmeans cluster centers saved at file_path: data/openai_cluster_centers.json

Store the data in Elasticsearch

[36]
[ ]
2025-01-23 11:06:57.607 | INFO     | __main__:ingest_data_to_es:50 - Ingested 4593 documents into index 'newsgroups_openai_dataset'
[40]

Add clustering ingest pipeline

[52]
ObjectApiResponse({'acknowledged': True})

When simulating the ingest pipeline, we can see that a cluster number is assinged to a test document.

[63]
dict_keys(['news_body', 'openai_vector', 'target', 'ml_clustering.closestCluster', 'ml_clustering.minDistance'])
2

Finally, we will reindex the dataset into the new pipeline, which will assign ml_clustering.closestCluster:

[84]

After waiting about a few moments, the task will complete:

[95]
task completed:  True
/var/folders/ls/ql10y6711bb9z117p2_ycjwr0000gn/T/ipykernel_29384/1858691271.py:1: GeneralAvailabilityWarning: This API is in technical preview and may be changed or removed in a future release. Elastic will work to fix any issues, but features in technical preview are not subject to the support SLA of official GA features.
  print("task completed: ", es_client.tasks.get(task_id=task['task'])['completed'])

Visualization

[1]

For the visualizations, first we are going to assign a label to the ml_clustering.closestCluster. In order to do this, we slice the data by the newsgroup name (target), then count the cluster numbers assigned in that slice, and assign a mapping based on the highest count.

[2]
{4: 'comp.sys.ibm.pc.hardware', 2: 'talk.religion.misc', 1: 'rec.sport.hockey', 3: 'rec.sport.baseball', 0: 'sci.med'}

Before creating the visualizations, let's look at the in-sample accuracy and confusion matrix:

[22]
accuracy 0.9623339865011975
[[973   0   0   4   5]
 [  9 935  22   6  22]
 [  9  21 963   1   5]
 [  6   1   0 931  52]
 [  0   0   0  10 618]]

The last thing we need to do for the visualizations is to compute the t-SNE embeddings for the text-ada-002 vectors;

[4]

Then we can create visualizations!

[25]