
K-means clustering is a widely used technique in Machine Learning used as a form of vector quantization to divide a vector space of points into k clusters. It is an extremely useful unsupervised method of clustering multiple points into a collection of centroids, which can later be used for many downstream tasks with deep learning. However, sometimes vanilla k-means is too computationally expensive or takes days to run on large datasets. That is where methods such as using k-means hierarchically or consuming the data in minibatches greatly increase the efficiency while still keeping comparable performance.
The vanilla k-means algorithm starts with the data in a vector space, which we will call X, where X consists of r points (x₁, x₂, …, xᵣ). We would like to essentially fit k centroids on to X such that for each centroid’s cluster, the variance is minimized. In layman’s terms, that means that each centroid’s set of points forms a nice cluster, distinct from every other centroid’s cluster.
So how would this be done in practice? There are many ways to do this, but k-means uses something called Lloyd’s algorithm. First, the centroids are initialized to random points in the vector space. Every point in X is assigned a centroid, which is calculated by finding the nearest centroid to each point. After that, every centroid’s position is updated to be the mean of all the points it is associated with or its cluster. However, after this step, some points in X will be assigned to a different centroid considering that the centroid positions were updated. So we repeat the previous steps, updating each centroid’s position to reflect the changing mean of the points in its cluster. We do this until the position of the centroids does not change anymore, or in other words when it has converged. This allows the centroids to gravitate towards the means of the inherent clusters present in the data. Here is a gif that illustrates the entire process:
As you can see, this naïve k-means clustering method works fairly well for small sets of data. However, there are many methods to make this faster and more efficient.
Hierarchical k-means clustering (aka hk-means clustering), not to be confused with hierarchical clustering (which is something completely different), is a more efficient sibling of vanilla k-means. It is a recursive method involving doing k-means on the data with k being relatively small, then doing k-means on every centroid cluster using the same k, for h hierarchies. The centroids for the last hierarchy are concatenated to be the entire dataset’s centroids. This method is beneficial when having a large k and when it is computationally expensive to update the k centroids. A visual representation of this method is shown below:
As the name suggests, minibatch k-means and minibatch hk-means use minibatches to make this algorithm applicable to extremely large datasets. Similar to how a neural network consumes data in batches, this offshoot of k-means iterates over and updates the centroids after every batch. This makes it simple to do k-means on large datasets and get relatively similar results to how regular k-means would yield. For a coded example of minibatch hk-means, check out this python package:
K-means is a useful technique that can be used as a clustering method, a vector quantization, and much more. It also has multiple offshoots, which can improve performance and efficiency on large datasets in the real world. Two of these variants are hierarchical k-means and minibatch k-means, but there are countless other ways to further improve this performance, such as smartly initializing the centroids. With so many uses and so many improvements, it is not a surprise that k-means remains one of the most prominent clustering algorithms in Machine Learning.