The last hidden layer outputs 2 scalar latent features per node. These 2 scalar values represent a node. The nice part is, in this example, we can compute the latent representation based on the social graph alone even without training. In fact, we run the model for one iteration with W⁰ and W¹ initialized randomly with a normal distribution. Let’s not even perform any training on W in this example. i.e. we are not going to perform gradient descent to update W. For X, we simply use an identity matrix as there is no node feature.
In the end, we plot out each node using the output latent features as the x and y coordinates.
As shown, when we color each node with its ground truth (blue for Mr. Hi club and red for the administrator), the latent features generated are highly correlated with which club a person belongs to. In short, we just apply clustering on the input social graph. Intuitively, our example produces latent features similar to their neighbors since the hidden layer is averaging the latent features with its neighbors before applying the activation function. The algorithm manages to cluster neighbors together as the output for each node in a hidden layer is closely related to their neighbors and the associated weights in the graph.
Let’s consider a semi-supervised problem for classification or clustering in which only one data point is labeled for each class or cluster. Once the latent features for each node are calculated, we can compute the distance between non-labeled data and the labeled data. Then, we find the nearest neighbor to a known labeled data to classify or cluster the unlabeled data.
All propagation rules discussed so far are differentiable. For semi-supervised or unsupervised problems, we can use backpropagation to train the weights using the labeled data if needed. (For our last example, even without the extra training, the result is good.) As a demonstration, the GCN model below is trained for 300 iterations. In this model, only a single label per class is provided. As the training progress, we see how nodes belonging to the same ground truth classes start clustering together.
Engineering problems can be solved in the spatial domain or the spectral domain (a.k.a. frequency domain). For example, in signal processing, we apply a Fourier transform to convert an audio input from the temporal domain to the frequency domain and apply a low-pass filter to remove the high-frequency noise. In convolution, these filters can be defined with a spatial or a spectral approach. Which domain to use for the specific step depends on how easy to design the filter and how easy to perform the operation.
In deep learning, researchers also approach the convolution from either the spatial or spectral-domain. For example, in the previous section, we apply a spatial graph convolution in the form of σ(D̂⁻¹ÂHⁱWⁱ).
But GCN is actually a spectral graph convolution. It is a localized first-order approximation of spectral graph convolutions with the propagation rule below.
GCN — A Spectral Graph Convolution (optional)
However, it will need some mathematical concepts, equations, and a few papers to explain it. And because of its complexity, we will show the high-level steps only. But it is not very important to understand it thoroughly.
Convolutional filters are translation-invariant to allow weight sharing (the same filter slides over an input). When working in the spatial domain, it recognizes identical features independently of their spatial locations. A graph does not have a clear spatial concept or a mathematical definition of spatial translation. On the other hand, the spectral graph convolution is based on spectral graph theory which provides the framework to design filters on graphs. And it provides the mathematical background to design operators (filters) with the translation-invariant property.
At a high level, a spectral graph convolution is defined in the Fourier domain by applying the filter gθ on the input signal x.
Previously, we model a graph with the adjacency matrix A and normalize it with the diagonal node degree matrix D. Under the spectral graph theory, an undirected graph is represented by the normalized graph Laplacian matrix L which is defined as
L is a real symmetric positive semidefinite matrix. We will skip the proof but for an N×N matrix, if it is positive semidefinite, it has N orthogonal eigenvectors. In short, these N independent vectors form a basis that can diagonalize a matrix, i.e., L can be factored as L = UΛUᵀ like the one below where U contains the eigenvectors of L and Λ is a diagonal matrix containing the corresponding eigenvalues. These eigenvectors are also known as the graph Fourier modes, and the corresponded eigenvalues are the frequencies of the graph. When a matrix is written in this form, many operations, like Sᵏ, can be computed fast and easily.
The Laplacian is diagonalized by the Fourier basis U and the spectral convolution will be defined as a linear operator with U. In specific, the graph Fourier transform on x is defined as
Since there is no meaningful translation operator in the spatial domain, the graph convolution operator is defined in the Fourier domain. It is the multiplication of the graph x with a filter in the Fourier space. i.e.
x *G g means performing spectral convolution on x with filter g.
If we denote a filter as
Then the graph convolution will be simplified as:
Spectral Convolutional Neural Network assumes the filter gθ is a diagonal matrix filled with learnable parameters Θᵏᵢⱼ. Θᵏᵢⱼ contains parameters for layer k that maps the ith input feature to the jth output feature. For layer k, the output of a hidden layer output is
The shortfall of Spectral Graph Convolution
The appeal of Spectral Graph Convolution is that the spectral graph theory has operators with well-defined translation properties. However, computing the eigenvectors of L is computation expensive, we want to avoid that. Another shortfall is that these filters are not localized in general. Its complexity grows with the size of the graph and not scalable. Indeed, to compute the output of a node, we should hop over a limited number of neighbors, say at most K hops from the central node. This leads to researches of applying approximation to the spectral graph convolution, like those in ChebNet and GCN, such that filters are localized. And because of this issue, spatial graph convolution also receives more attention in recent years.
So how can we find a localized filter in the spectral graph convolution? Chebyshev Spectral CNN (ChebNet) approximates the filter gθ using a truncated expansion in terms of Chebyshev polynomials Tk(x) up to Kth order.
The convolution of a graph x with the defined filter gθ becomes:
This is K-localized as it is a Kth-order polynomial in the Laplacian. It depends only on nodes that are at maximum K steps away from the central node (details). It means it is localized in space and scales well regardless of the graph size. Even ChebNet and GCN start as a Spectral Graph Convolution, they apply approximation such that their filters are localized.
As Tᵢ(L̂) can be written as:
The convolution becomes
Hence, we don’t need to calculate the eigenvectors and it is localized.
GCN introduces a first-order approximation of ChebNet by assuming K = 1 and λmax = 2. The equation becomes
To reduce the number of free parameters and to avoid over-fitting, GCN assumes θ = θ₀= −θ₁, and the equation becomes
But the L.H.S. term below has eigenvalues in the range [0, 2], to avoid exploding/vanishing problems, a renormalization trick is applied that results in the R.H.S. term below.
Finally, the propagation rule for GCN is
Different spatial graph convolutions depend on different aggregators to gather information from each node’s neighbors. Conceptually, we can also view it as message passing.
Without the approximations in Spectral Graph Convolutions, Spatial Graph Convolutions are usually more scalable since their filters are localized. The major challenge is defining a local invariance of CNNs that work with central nodes that have different numbers of neighbors.
Message Passing Neural Network (MPNN)
MPNN outlines a general message passing framework for Spatial Graph Convolution. It passes information (message) from one node to another along edges and repeats it K-step to let information propagate through the graph. The equation below is the hidden feature representation for the kth layer for node v. It depends on the hidden feature of v and its neighbors in the previous layer as well as the edge features with its neighbors. Different choices of U and M functions will lead to different variants of the model.
The latent representation of a node in the last hidden layer can be passed to an output layer to perform node-level predictions. Or it can pass to a readout function R with learnable parameters to perform graph-level predictions.
For example, in drug discovery, a graph represents a chemical compound with atoms as nodes and chemical bonds as edges. We may want to classify whether this compound can hinder the growth of cancer cells or it is carcinogenic.
Diffusion Convolutional Neural Network (DCNN)
DCNN treats graph convolutions as a diffusion process. It transfers information from one node to one of its neighbors with a probability transition matrix P that equals D⁻¹A. This process will repeat K times so that information distribution will reach an equilibrium. And the hidden representation Hᵏ at the kth step will be calculated from Pᵏ, but not on the previous hidden representation.
DCNN concatenates H⁽¹⁾, H⁽²⁾, · · ·, H⁽ᴷ⁾ together to form the final model outputs.
Graph Isomorphism Network (GIN)
However, GIN claims that the graph embeddings in MPNN methods fail to distinguish different graph structures. To fix that, GIN introduces a learnable parameter εᵏ to adjust the weight of the central node.
Social influencers have many connections. For a graph with nodes that have many edges, convolutions may not scale well. GraphSage uses sampling to obtain a fixed number of neighbors for each node for message passing instead of all neighboring nodes.
Here is the formularization:
Besides using a mean aggregator for f, it can use an LSTM aggregator or a pooling aggregator with the equations below:
FastGCN refines the sampling algorithm further. Instead of sampling neighbors for each node, FastGCN uses importance sampling. It samples nodes (u) from a distribution q that reflects how well it is connected to other nodes. Then, it applies importance sampling to estimate the loss gradient of node v from u.
Graph Attention Network (GAT)
GAT uses attention to learn the relative weights between two connected nodes in message passing. First, GAT calculates the attention for the node pair i and j. The resulting vectors Whᵢ (hᵢ is the hidden representation of node i) and Whⱼ are concatenated (|| operator) and multiply with learnable vector a. Their attention, measuring the connective strength between them, is then computed with a softmax function.
The hidden representation for node i is then weighted by the computed attention α with the final result as
Mixture Model Network (MoNet)
MoNet uses node pseudo-coordinates to determine the relative position between a node and its neighbor. Consider node y is a neighbor of node x. First, a d-dimensional vector of pseudo-coordinates u(x, y) is calculated as below. The MoNet paper uses the degrees of the nodes as the pseudo-coordinates and then transforms them further with a fully-connected network.
Once the relative position between two nodes is calculated, a weight function maps the relative position to the relative weight between these two nodes. This weighting function w is parametrized by learnable parameters Θ and composed of J components:
In MoNet, the weight function is defined as:
where μⱼ and Σⱼ are trainable parameters for a Gaussian.
Let’s summarize it. The coordinate system (x, y) for an image is defined in Euclidean space with the output pixel value being f(x, y). With MoNet, the weight function above allows us to expand the concept to non-Euclidean space and apply a shared filter that works across different locations. Below is the detailed equation for applying the convolution filter g on f.
MoNet can be generalized with different choices of u and weight functions. The diagram below shows how different GNN can be viewed as MoNet using the specific u and w.
Large-Scale Learnable Graph Convolutional Networks (LGCN)
In each LGCL layer, it selects and sorts the largest k values for each input feature channel for the neighbors of the central node. With the features of the central node, it forms a (k+1) × 3 matrix. Then a CNN filter is applied. In the diagram below, two filters are applied to create a new node representation with 5 output channels.
LGCN applies multiple layers of LGCL as in the diagram below. The input nodes are transformed into low-dimensional representations using a graph embedding layer. In some cases, it can simply use a linear transformation like the one below.
Then it follows with two LGCL layers with skip concatenation connections to produce the output, i.e. the inputs and outputs of LGCL are concatenated together. Finally, a fully-connected layer is used to classify each node.
In the diagram below, we apply layers of convolutions and ReLU to generate the latent representation for each node.
But in other applications, we are interested in generating a representation of the graph. For example, consider a sample (G, y) where G is a graph, and y is its class. As a classification problem, we read G and predict y — the representation of the graph. In this context, pooling can be applied to generate a coarser graph which can be further reduced with a readout operation. This is similar to the conventional CNN where we apply max pool to reduce the spatial resolution.
In general, the pooling and the readout are similar and mathematically, can be generalized as:
If there is a consistent way to map nodes in a graph to a Euclidean space, there will be no ambiguity on which weight should be associated with a specific node when a filter is applied. So for a spatial graph convolution, the challenge is how to assign weights to nodes consistently.
The two graphs below are similar in topology with the exception of the two red edges below. We color nodes in both graphs with similar colors if they have similar network topology. For example, both nodes D and 2 are colored as orange.
And when we apply a convolution filter, both graphs should generate similar latent features for the corresponding nodes. One way to do it is to sort nodes by the assigned color (say, by RGB values) and assign the corresponding weights accordingly.
DGCNN addresses the same problem on how to sequentially read a graph and associate weights in a consistent order. So it introduces the SortPooling layer which sorts graph nodes by color so that neural networks can be trained on these ordered nodes.
By presenting graphs in a consistent way according to their topology, the 1-D convolution and dense layers will be much easier to train. For example, the graphs below are isomorphic and will be presented in the same way to the NN after the SortPooling layer.
So how can we color nodes in DGCNN? It uses the Weisfeiler-Lehman algorithm.
The concept is very similar to our first clustering example for Zachary’s karate club. But we will not get into the details here. Please refer to the DGCNN paper. Basically, it learns the network embedding (signature) of the nodes based on the network topology. It concatenates a node’s color with its immediate neighbors and then sorts it lexicographically to assign new colors. Nodes with the same signature will be assigned with the same color. The process is repeated until the colors converge or after h iterations. In the end, nodes with the same color share the same structural role within the graph. Then the colors in the last convolution layer will be used to sort the nodes. Therefore, a consistent ordering is imposed for graph nodes. After sorting, DGCNN truncates or extends the array of nodes. This supports graphs with different numbers of nodes.
DiffPool is based on a hierarchical concept in generating a coarser graph. It learns a soft cluster assignment matrix Sˡ with element i, j contains the probability for the node i in layer l to be clustered and assigned to node j in the next hierarchical layer in a coarser graph.
Then, this soft assignment can be used to calculate the new node embedding and adjacency matrix for the next hierarchical layer.
With Sˡ generated by
This model is differentiable and the objective is to learn both GNN models that generate Z and S.
I do not proofread this article as much as I usually do. And many concepts here are pretty complex and hard to understand. So if I make any mistakes, please feel free to leave a comment.