What are Graph Neural Networks? A Short Overview

Nathan Bailey
8 min readJul 27, 2024

--

Recently, I have been researching quite a few neural network architectures that are new to me. One that caught my eye was Graph Neural Networks. This blog details the architecture of these networks and will be followed up with a blog detailing the implementation of such a network.

The theory presented in this blog is adapted from the book: “Deep Learning Foundations and Concepts” by Christopher M. Bishop. [1]

Introduction

Graph Neural Networks are Neural Networks that operate on graph data. The network takes in a graph which is comprised of nodes and edges. An embedding vector is defined for each node, which is transformed through layers in the network to create a learned representation.

This can also be extended to transform the edges of the graph and the graph as a whole through the network.

What is a graph?

A graph consists of a set of nodes and a set of edges. An edge from node n to m is denoted as (n, m). If an edge links 2 nodes they are called neighbours. Data is also associated with a node, usually given by a D-dimensional vector Xd. We can then group this data into an NxD matrix for all the nodes.

An example of a graph is shown below:

Example Graph [1]

Here we have nodes A, B, C, D, and E. The neighbours of C are A, B, D and E.

Edges in a graph can be specified by an adjacency matrix A. To do this, we have to place an order on the nodes, we can index N nodes from 1, …, N. The adjacency matrix has dimensions NxN and contains a 1 in the entry n,m if there is an edge from node n to m. All other entries are 0.

In the example above, if we index the nodes in the order A, B, C, D, E, this results in the following adjacency matrix:

Adjacency Matrix [1]

Here we can see that C has an entry in its row for all nodes since every node (apart from itself) connects to C.

Then, of course, the adjacency matrix depends on the ordering assigned to the nodes when writing down this matrix. The number of permutations increases factorially with the number of nodes.

We introduce the concept of a permutation matrix P, which is the same size as the adjacency matrix. It contains a single 1 in each row and column, meaning that if there is a 1 in position n,m, node n will be relabelled as node m after the permutation. 0 in all other elements.

An example permutation matrix is shown below and details the permutation from (A, B, C, D, E) -> (C, E, A, D, B). For example, in the first row of the matrix, we can see that node 1 (A) will be reindexed as node 3 after the permutation.

Permutation Matrix [1]

If we re-order the node labelling, the resulting data matrix is given by:

The adjacency matrix is given by:

When we present the graph data to our model, we will need to assign an ordering to the nodes. We therefore need to make sure that the network output is invariant (does not depend on) to node label reordering. If y(.) is the output of the network, then:

This applies to a graph neural network when we are outputting a vector based on the entire graph data. For example, we may want to classify the entire graph. If we re-order the nodes, we want the same output.

We also may want to produce outputs based on a particular node. These outputs should be equivariant with node labelling. I.e. the output preserves ordering. If we re-order the node labelling, the predictions should show the same re-ordering. Hence:

Network Architecture

We use convolutional filters as an inspiration for graph neural networks. We can view a convolutional filter as a graph, where we have a node i and the other 8 nodes are neighbours N(i) (looking at the input layer l).

Convolutional Filter as an Inspiration for Graph Neural Networks [1]

In this case, our output of the layer is given by the following equation, where we multiply the values from the first layer with a weight vector and sum to produce our result.

Convolutional Filter Equation [1]

However, this is not equivariant under re-ordering the nodes in layer l. If we were to change the order of the neighbouring nodes, we would multiply the nodes by different weight values, since each neighbouring node has its own weight value.

We modify the equation to create a layer such that the output is equivariant to the input. The following computation is computed:

Modified Equation [1]

Where each node has a shared parameter Wself, and we have a single weight parameter Wneigh that is shared across the neighbours. The bias must also be shared across all nodes. If we enforce these 3 rules, then the calculation is equivariant under node re-ordering.

We can view this equation as message passing, where neighbouring nodes pass messages into node i as illustrated below:

Neighbours pass their “messages” to nodes [1]

The messages are activations of the other nodes, we combine this with information from node i and then pass it through a non-linear activation function.

To formalize this, for each node n in the graph and for each layer l, we introduce a D-dimensional column vector hn(l) which is our node embedding. Each layer will take the node embeddings from the previous layer, and process them as per the equation above, which then provides us with our new node embeddings for the current layer.

We initialize the node embedding hn(0) as the node data xn .

Aggregation and Update

We can view the computation as 2 operations, aggregation and update. Aggregation is the collection of information from the neighbours, resulting in the vector zn(l). The update is when we update the current node’s embedding.

Aggregation can be given in many forms, the simplest being a summation:

Aggregation Function [1]

We could have something a bit more complex, where we introduce learnable parameters. For example, a multi-layer perceptron could be applied to the node embeddings. To ensure equivariance, we must make sure to share the perceptrons across layer l. I.e. all the nodes use the same networks.

Aggregation Function using MLPs [1]

The general formula for update is given by:

Generic Update Rule [1]

If we chose a summation and set Wneigh = Wself, the update is given by:

Update Rule using Sum Aggregation [1]

In simple terms, a graph neural network can be viewed as a series of layers, in which each layer transforms a set of node-embedding vectors into a new set of the same size and dimensionality.

The node embeddings can be grouped into which can be grouped into a matrix H. The computations in the network are then given by the following set of equations:

Complete Equations for a GNN [1]

Node Classification

Given a graph neural network, we can perform several tasks. The simplest is node classification.

To achieve this, we can define an output layer, which calculates a softmax for each node over C classes. This can be given by the formula below:

Node Classification Output [1]

Here we are calculating the probability of node n being in class i. As seen, we are again sharing the weights, such that if we had another node m, we would use the same weights to calculate the probability of it being in class i.

Extensions to Graph Neural Networks

Edge Embeddings

We have described above the general network structure when we have data associated with nodes and we produce node embeddings. It might be that we have data that is associated with the edges as well as the nodes. We can introduce edge embeddings and use the following formulae to update the node and edge embeddings.

Using Edge Embeddings [1]

The final edge embeddings Enm(L) can then be used to make predictions associated with the edges.

Graph Embeddings

In addition to node and edge embeddings, we can also maintain and update an embedding vector g(l) that relates to the graph as a whole. We can then have a set of general equations for updates shown below:

Using Graph and Edge Embeddings [1]

We update the edge embeddings which are then aggregated to give a set of aggregated vectors. These contribute to the update of the node embedding vector h. Finally, the graph embedding vector is updated based on the information from all the nodes and edges in the graph along with the graph embedding from the previous layer.

Conclusions

This blog has detailed the theory behind graph neural networks, in my next blog we will begin to look at how these networks can be implemented in Keras.

  1. Deep Learning: Foundations and Concepts, Christopher M Bishop

--

--

Nathan Bailey

MSc AI and ML Student @ ICL. Ex ML Engineer @ Arm, Ex FPGA Engineer @ Arm + Intel, University of Warwick CSE Graduate, Climber. https://www.nathanbaileyw.com