During my literature review, I stumbled upon an information-theoretic framework to analyse the link prediction problem (Tan et al. (2014), Kumar and Sharma (2020)). For an overview of what link prediction is, read my previous article here. The basic idea is to predict unseen edges in a graph. Such edges could represent citing relationship, friendship, or dependencies in different networks. The framework models such task as an information theory problem, where an edge is more likely to form over a pair of nodes if it has a high probability (low uncertainty/low information) based on the chosen model.
In this article, we show how might one use the framework in practice. We first introduce the model and provide a code implementation to run this for a small graph. This article is heavily derived from Tan et. al. (2014), so check out their paper for more details.
Given a graph G, we’d like to identify the likelihood of observing an edge between a pair of nodes (x, y). The framework defines such likelihood (s_xy) as the negative of conditional information of seeing edge (x, y) given their common neighbours O_xy. In information theory, the higher the probability, the lower the information (uncertainty is reduced) and hence by adding the negative term, the likelihood (s_xy) is positively proportional to probability.
For brevity purpose, we didn’t derive the whole equation here (see the aforementioned tutorial notes for more details). Instead, we immediately state the final equation of the likelihood. We can see that it is basically an aggregation of probabilities (highlighted in red) from common neighbours of node x and y.
There are two main types of probabilities, one is prior p(L_mn) or p(L_xy), and the other one is likelihood p(L_mn | z). The latter is a conditional probability after taking into account a common neighbour z of node x and y. The two probabilities can be flexibly chosen, depending on the researchers’ affinity. In Tan et. al. (2014), however, the following choice of probability models was used.
The prior model is defined as 1 minus the combinatoric probability of seeing no edges formed between m and n at all, out of the total M edges in the graph.
The likelihood model is defined as the clustering coefficient of common neighbour node z. This coefficient is defined as the ratio of the number of observed triangles over the number of possible triangles (i.e. triads) formed by z and its neighbours.
There was no theoretical proof why the two models are the most optimal choice. However, Tan et. al. (2014) showed that their performance is often empirically superior to the other models.
We can plug the two models back to the original and framework, and we’re left with this final equation.
We now implement the framework in python. First, we import the necessary packages.
import networkx as nx
import numpy as np
To show how the framework is utilised in practice, we’ll use a dummy graph provided by Tan et. al. (2014). We use networkx to define the graph object.
G = nx.Graph()
edgeList = [(1, 2),(1, 3),(1, 4),(2, 4),(2, 5),(5, 6),(5, 7),(6, 7),(6, 8),(7, 8)]
Next, we define the likelihood score as a function that takes in node x and y. It consists of several loops aggregating information from common neighbours of node x and y. The framework takes in two probabilities model: prior and likelihood that we’ll describe next.
# score based on mutual information
def s(x , y, prior, likelihood):
# common neighbors
CN = nx.common_neighbors(G, x, y) # prior (x, y)
priorXY = — np.log2(prior(x, y, G)) # sum over neighbors
cnMI = 0
for z in CN:
# degree of z
kz = G.degree(z)
coeffZ = 1 / (kz * (kz-1)) # sum over edges = neighbors of z
zMI = 0
for m, n in itertools.combinations(G.neighbors(z), 2):
priorInfo = — np.log2(prior(m, n, G))
likelihoodInfo = — np.log2(likelihood(z, G)) # combine mutual information
zMI += 2 * (priorInfo — likelihoodInfo) # add average mutual information per neighbor
cnMI += coeffZ * zMI return cnMI — priorXY
We now define prior as a function of the combinatorics probability.
def prior(m, n, G):
kn = G.degree(n)
km = G.degree(m)
M = G.number_of_edges()
return 1 — math.comb(M-kn, km)/math.comb(M, km)
Meanwhile, likelihood is defined as the clustering coefficient of common neighbour z of node x and y.
def likelihood(z, G):
kz = G.degree(z)
N_triangles = nx.triangles(G, z)
N_triads = math.comb(kz, 2)
return N_triangles / N_triads
We have now defined the whole framework. Following the example provided by Tan et. al. (2014), we can calculate the score for node 2 and 3 by calling the s function. And voilà! we get -1.667, precisely as shown in the paper.
s(2, 3, prior, likelihood)
We have seen how to integrate the link prediction problem into the information-theoretic framework. Using the latter, we could plug in a few different probability models depending on our choice. In this article, we have looked at one particular model based on clustering coefficient and combinatorics probability. For more types of models, please refer to Tan et. al. (2014) and Kumar and Sharma (2020).
- Kumar, P. & Sharma, D. A potential energy and mutual information based link prediction approach for bipartite networks. Scientific Reports 10, 20659 (2020).
- Tan, F., Xia, Y. & Zhu, B. Link Prediction in Complex Networks: A Mutual Information Perspective. PLOS ONE 9, e107056 (2014).