Hugging Face Graphml Classification

Graphml Classification

hf-bloghacktoberfestnotebooks

alph-notebooks/hf-blog / graphml-classification.ipynb

Export

Run Notebooks

Contents

No cells yet

Add cells to see them here

Graph Classification with 🤗 Transformers

This notebook shows how to fine-tune the Graphormer model for Graph Classification on a dataset available on the hub. The idea is to add a randomly initialized classification head on top of a pre-trained encoder, and fine-tune the model altogether on a labeled dataset.

Depending on the model and the GPU you are using, you might need to adjust the batch size to avoid out-of-memory errors. Set those two parameters, then the rest of the notebook should run smoothly.

In this notebook, we'll fine-tune from the https://huggingface.co/clefourrier/pcqm4mv2-graphormer-base checkpoint.

Dependencies

Before we start, let's install the datasets and transformers libraries, as well as Cython, on which this model depends.

[ ]

If you're opening this notebook locally, make sure your environment has an install from the last version of those libraries. Transformers version must be > 4.27.2.

We check that Cython is correctly installed.

[ ]

If you want to visualize your graphs, you also need to install matplotlib and networkx.

[ ]

To be able to share your model with the community and generate results like the one shown in the picture below via the inference API, there are a few more steps to follow.

First you have to store your authentication token from the Hugging Face website (sign up here if you haven't already!) then execute the following cell and input your token:

[ ]

Then you need to install Git-LFS to upload your model checkpoints:

[ ]

Fine-tuning Graphormer on an graph classification task

In this notebook, we will see how to fine-tune the Graphormer model on 🤗 Transformers on a Graph Classification dataset.

Given a graph, the goal is to predict its class.

Loading the dataset

Loading a graph dataset from the Hub is very easy. Let's load the ogbg-molhiv dataset, stored in the OGB repository. To find other graph datasets, look for the "Graph Machine Learning" tag on the hub: here. You'll find social graphs, molecular datasets, some artificial ones, etc!

This dataset contains a collection of molecules (from MoleculeNet), and the goal is to predict if they to inhibit HIV or not.

[ ]

Let us also load the Accuracy metric, which we'll use to evaluate our model both during and after training.

[ ]

The dataset object itself is a DatasetDict, which contains one key per split (in this case, "train", "validation" and "test" splits).

[ ]

To access an actual element, you need to select a split first, then give an index:

[ ]

Each example consists of an graph (made of its nodes, edges, and optional features) and a corresponding label. We can also verify this by checking the features of the dataset:

[ ]

We can inspect the graph using networkx and pyplot.

[ ]

Let's print the corresponding label:

[ ]

Preprocessing the data

Graph transformer frameworks usually apply specific preprocessing to their datasets to generate added features and properties which help the underlying learning task (classification in our case).

Here, we use Graphormer's default preprocessing, which generates in/out degree information, the shortest path between node matrices, and other properties of interest for the model.

[ ]

Let's access an element to look at all the features we've added:

[ ]

Training the model

Calling the from_pretrained method on our model downloads and caches the weights for us. As the number of classes (for prediction) is dataset dependent, we pass the new num_classes as well as ignore_mismatched_sizes alongside the model_checkpoint. This makes sure a custom classification head is created, specific to our task, hence likely different from the original decoder head.

(When using a pretrained model, you must make sure the embeddings of your data have the same shape as the ones used to pretrain your model.)

[ ]

The warning is telling us we are throwing away some weights (the weights and bias of the classifier layer) and randomly initializing some other (the weights and bias of a new classifier layer). This is expected in this case, because we are adding a new head for which we don't have pretrained weights, so the library warns us we should fine-tune this model before using it for inference, which is exactly what we are going to do.

To instantiate a Trainer, we will need to define the training configuration and the evaluation metric. The most important is the TrainingArguments, which is a class that contains all the attributes to customize the training. It requires one folder name, which will be used to save the checkpoints of the model.

For graph datasets, it is particularly important to play around with batch sizes and gradient accumulation steps to train on enough samples while avoiding out-of-memory errors.

[ ]

In the Trainer for graph classification, it is important to pass the specific data collator for the given graph dataset, which will convert individual graphs to batches for training.

[ ]

We can now train our model!

[ ]

You can now upload the result of the training to the Hub with the following:

[ ]