Weights and Biases 8 Node Classification (With W&B)

8 Node Classification (With W&B)

wandb-examplespygcolabs

alph-notebooks/wandb-examples / 8_Node_Classification_(with_W&B).ipynb

Export

Run Notebooks

Contents

No cells yet

Add cells to see them here

[ ]

Setup and login to Weights & Biases

[ ]

Helper function for visualization.

[ ]

Node Classification with Graph Neural Networks

Previous: Introduction: Hands-on Graph Neural Networks

This tutorial will teach you how to apply Graph Neural Networks (GNNs) to the task of node classification. Here, we are given the ground-truth labels of only a small subset of nodes, and want to infer the labels for all the remaining nodes (transductive learning).

To demonstrate, we make use of the Cora dataset, which is a citation network where nodes represent documents. Each node is described by a 1433-dimensional bag-of-words feature vector. Two documents are connected if there exists a citation link between them. The task is to infer the category of each document (7 in total).

This dataset was first introduced by Yang et al. (2016) as one of the datasets of the Planetoid benchmark suite. We again can make use PyTorch Geometric for an easy access to this dataset via torch_geometric.datasets.Planetoid:

[ ]

Overall, this dataset is quite similar to the previously used KarateClub network. We can see that the Cora network holds 2,708 nodes and 10,556 edges, resulting in an average node degree of 3.9. For training this dataset, we are given the ground-truth categories of 140 nodes (20 for each class). This results in a training node label rate of only 5%.

In contrast to KarateClub, this graph holds the additional attributes val_mask and test_mask, which denotes which nodes should be used for validation and testing. Furthermore, we make use of data transformations via transform=NormalizeFeatures(). Transforms can be used to modify your input data before inputting them into a neural network, e.g., for normalization or data augmentation. Here, we row-normalize the bag-of-words input feature vectors.

We can further see that this network is undirected, and that there exists no isolated nodes (each document has at least one citation).

Training a Multi-layer Perception Network (MLP)

In theory, we should be able to infer the category of a document solely based on its content, i.e. its bag-of-words feature representation, without taking any relational information into account.

Let's verify that by constructing a simple MLP that solely operates on input node features (using shared weights across all nodes):

[ ]

(optionally) logging the data attributes to W&B summary.

[ ]

Our MLP is defined by two linear layers and enhanced by ReLU non-linearity and dropout. Here, we first reduce the 1433-dimensional feature vector to a low-dimensional embedding (hidden_channels=16), while the second linear layer acts as a classifier that should map each low-dimensional node embedding to one of the 7 classes.

Let's train our simple MLP by following a similar procedure as described in the first part of this tutorial. We again make use of the cross entropy loss and Adam optimizer. This time, we also define a test function to evaluate how well our final model performs on the test node set (which labels have not been observed during training).

We also visualize the embeddings of the untrained model to in visually comparing the progress made by the training process below.

NOTE: For W&B mode, please set up the embedding projector from the setting panel of the logged table. More information can be found here: https://docs.wandb.ai/ref/app/features/panels/weave/embedding-projector

[ ]

After training the model, we can call the test function to see how well our model performs on unseen labels. Here, we are interested in the accuracy of the model, i.e., the ratio of correctly classified nodes:

We also visualize the embeddings of the output. This will give us a visual hint as to how good the model is performing, when compared to the embeddings of the geometric models defined below.

[ ]

As one can see, our MLP performs rather bad with only about 59% test accuracy. But why does the MLP do not perform better? The main reason for that is that this model suffers from heavy overfitting due to only having access to a small amount of training nodes, and therefore generalizes poorly to unseen node representations.

It also fails to incorporate an important bias into the model: Cited papers are very likely related to the category of a document. That is exactly where Graph Neural Networks come into play and can help to boost the performance of our model.

Training a Graph Neural Network (GNN)

We can easily convert our MLP to a GNN by swapping the torch.nn.Linear layers with PyG's GNN operators.

Following-up on the first part of this tutorial, we replace the linear layers by the GCNConv module. To recap, the GCN layer (Kipf et al. (2017)) is defined as

\mathbf{x}_v^{(\ell + 1)} = \mathbf{W}^{(\ell + 1)} \sum_{w \in \mathcal{N}(v) \, \cup \, \{ v \}} \frac{1}{c_{w,v}} \cdot \mathbf{x}_w^{(\ell)}

where $\mathbf{W}^{(\ell + 1)}$ denotes a trainable weight matrix of shape [num_output_features, num_input_features] and $c_{w,v}$ refers to a fixed normalization coefficient for each edge. In contrast, a single Linear layer is defined as

\mathbf{x}_v^{(\ell + 1)} = \mathbf{W}^{(\ell + 1)} \mathbf{x}_v^{(\ell)}

which does not make use of neighboring node information.

[ ]

Let's visualize the node embeddings of our untrained GCN network. For visualization, we make use of TSNE to embed our 7-dimensional node embeddings onto a 2D plane.

[ ]

We certainly can do better by training our model. The training and testing procedure is once again the same, but this time we make use of the node features x and the graph connectivity edge_index as input to our GCN model.

[ ]

After training the model, we can check its test accuracy:

[ ]

There it is! By simply swapping the linear layers with GNN layers, we can reach 81.5% of test accuracy! This is in stark contrast to the 59% of test accuracy obtained by our MLP, indicating that relational information plays a crucial role in obtaining better performance.

We can also verify that once again by looking at the output embeddings of our trained model, which now produces a far better clustering of nodes of the same category.

[ ]

Using W&B Sweeps

In this section, we'll look into how we can use W&B Sweeps to perform a hyper-parameter search for the GCN. For this to work, it is essential for wandb to be enabled, i.e., enable_wandb should be set to True.

[ ]

Conclusion

In this chapter, you have seen how to apply GNNs to real-world problems, and, in particular, how they can effectively be used for boosting a model's performance. In the next section, we will look into how GNNs can be used for the task of graph classification.

Next: Graph Classification with Graph Neural Networks

(Optional) Exercises

To achieve better model performance and to avoid overfitting, it is usually a good idea to select the best model based on an additional validation set. The Cora dataset provides a validation node set as data.val_mask, but we haven't used it yet. Can you modify the code to select and test the model with the highest validation performance? This should bring test performance to 82% accuracy.
How does GCN behave when increasing the hidden feature dimensionality or the number of layers? Does increasing the number of layers help at all?
You can try to use different GNN layers to see how model performance changes. What happens if you swap out all GCNConv instances with GATConv layers that make use of attention? Try to write a 2-layer GAT model that makes use of 8 attention heads in the first layer and 1 attention head in the second layer, uses a dropout ratio of 0.6 inside and outside each GATConv call, and uses a hidden_channels dimensions of 8 per head.

[ ]