The theory was first published in the 2007 annals of Statistics by Gabor J. Szekely and others¹. It’s a measure of dependence between two paired random vectors of arbitrary, not necessarily equal, dimensions². It can also be used for K-sample testing with some design tweaks.
What’s great about it? It’s nonparametric testing, meaning that there are no underlying assumptions about the relationship between the variables and neither the distribution. It can efficiently detect the non-linear relationship of data with hyper dimensions compared to the traditional correlation methods, like Pearson’s r.
Math Definitions
Here is the formula to calculate the distance correlation.
X and Y are double-centered distance matrices. The covariance between these two matrices is the numerator, and the product of the standard deviations is the denominator. In my opinion, there are essentially three components to understand the method:
- Distances Matrix: a pairwise matrix that stores the euclidean distance between any two points of a given variable.
- Double Centered Matrix: For each element in the distance matrix, subtract it by the mean of the column and row, and add the grand mean of the matrix. In essence, this step is to change the origin to the center of the cloud of points (I know it may sound confusing initially, but a simple example is coming).
- Frobenius Inner Product: It’s the two double-centered matrices’ inner product from the previous step, which is a scalar value. The result indicates how similar the two matrices are. If they are independent, you will get 0. You can think of this operation as the vector dot product representing two matrices’ orthogonality in higher dimensions.
A Simple Example
Thanks to this discussion, I made some visualizations to illustrate the concept. Here we have four points (𝑋,𝑌)=[(0,0),(0,1),(1,0),(1,1)] that make a square.
The distance matrices for X and Y are:
After double-centering, the matrices will become:
The only point that satisfies the condition is the middle of the square.