Permutation based feature importance for clustering

Before I introduce the algorithm, I will try to explain the intuition behind the main idea. The idea is similar to what random forest algorithm does to rank the important features. However, random forest does that for supervised problems where class labels are present. We implemented similar idea for clustering where class labels are not present.

For demonstration, let us consider the following toy dataset.

A toy example: dataset with two features and a class a label. There are two classes: toddler and grown-up.

There are 2 features, weight and height, in the dataset. There are two classes: toddler and grown-up. Just by looking at the data we can see that the values related to toddlers are quite different compared to what they are for grown-ups. The darker blue rows are related to toddlers and the light blue row is related to grown-ups. Since we have a label in this case, we can use any machine learning algorithm and try to predict the class labels using features weight and height. We can then predict the class and calculate the prediction error by comparing the prediction to the given class labels (ideally, we should split it the data in training, test sets but for simplicity we will not go into that). Now, if we randomly shuffle one of the feature columns (please refer to the following picture) and fit the same machine learning algorithm again, what can we expect to happen to the prediction error?

Shuffling a feature will have negative impact on the prediction error.

It is reasonable to assume that the prediction error will decline with this shuffled dataset. Even by looking at the shuffled dataset we can see that the “relationship” between the features and the class labels are distorted by the shuffling. I look at this approach as if we are working with a locker combination. There is only one arrangement of numbers that can unlock the lock. This is similar to the original dataset that we started with. We can shuffle these combination dials any way we want. We can shuffle one of these combination dials or we can shuffle multiple combination dials. In either case, the ideal combination will be distorted, and we will not be able to open the lock.

Our approach is similar to how we use a locker.

Back to the toy dataset, we can do the random shuffling multiple times and calculate prediction error each time, and then calculate the mean of those prediction errors. We can do the same for all the features. At this point we assume that the mean prediction error will be lowest for the feature that is the most important. This is a reasonable assumption since when we shuffle a feature, it essentially becomes noise. So, by turning the most important feature into noise, we lose significant amount of information which negatively affects the prediction algorithm. As a result, the prediction error will be worse than what we had with the original unshuffled dataset. So, these mean prediction errors can be an indication of the importance of the features for the algorithm.

For unsupervised problems, we do not have the class labels but for clustering we use different objective functions. For example, k-means calculates the sum of squared distance from the cluster means, model-based clustering for gaussian mixture models use log-likelihood or Bayesian Information Criteria (BIC) as the objective function. So whichever clustering algorithm that we decide to use, we can identify the features for which the mean values of the objective function are considerably different from the value of the objective function based on the original dataset. This will help us rank the features based on their importance.

Footer