Good old K-means with a twist

K-means

The algorithm aims to partition the data points into k sets (i.e. the clusters) by minimizing the set variances. The latter simply means minimizing the distance from the centroid (center of a cluster).

Oftentimes, it is the case that we have additional information about our data points. Perhaps, each sample belongs to a certain group i.e. male/female, long term/new user, countries etc. We could therefore assign a weight to each sample given the group they belong to. For this purpose, we use Weighted K-means.

from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt# Generate data
centers = [[1, 1], [-1, -1], [1, -1]]
n_clusters = len(centers)
X, y_true = make_blobs(n_samples=3000, centers=centers, cluster_std=0.7)# K-means
kmeans = KMeans(n_clusters=n_clusters)
kmeans.fit(X)
y_kmeans = kmeans.predict(X)
centers = kmeans.cluster_centers_# Weighted K-means
weights = list(map(lambda x: x*10, y_true))
kmeans = KMeans(n_clusters=n_clusters)
kmeans.fit(X, sample_weight=weights)
y_kmeans_w = kmeans.predict(X, sample_weight=weights)
centers_w = kmeans.cluster_centers_# Plotting
fig, ax = plt.subplots(1, 3, figsize=(20,7))
colors = ['#d35400', '#34495e', '#2980b9']
versions = ['Original Data', 'K-means', 'Weighted K-means']
y = [y_true, y_kmeans, y_kmeans_w]
centers_ = [centers, centers_w]for i in range(3):
ax[i].scatter(X[:, 0], X[:, 1], s=50, c=list(map(lambda x: colors[x], y[i])), alpha=0.5)
elif i>0:
ax[i].scatter(centers_[i-1][:, 0], centers_[i-1][:, 1], marker='+', c='black', s=200)
ax[i].set_title(versions[i])
ax[i].set_xticks([])
ax[i].set_yticks([])
plt.show()

For the purpose of this section, we randomly generated a dataset of 3 clusters with a total of 3000 data points. The weights are assigned in the following way w_i = true_cluster_index * 10. Evidently, the clusters and their corresponding centroids are very much affected by the resulting weight vector. One should be very cautious on when and how to use Weighted K-means as it can have a big impact on the final centers of the clusters.

MiniBatch K-means

K-means is very powerful but sometimes it might be impossible to fit your dataset due to its large size. In such cases, you may use MiniBatch K-means, which is using mini-batches, moving the centroids just slightly at each iteration. This also results into a faster convergence but most importantly, it allows clustering of huge datasets that do not fit in memory.

from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt
import time# Generate data
centers = [[1, 1], [-1, -1], [1, -1]]
n_clusters = len(centers)
X, y_true = make_blobs(n_samples=3000, centers=centers, cluster_std=0.7)# K-means
kmeans = KMeans(n_clusters=n_clusters)
t0 = time.time()
k_means.fit(X)
t_batch = time.time() - t0
y_kmeans = kmeans.predict(X)
centers = kmeans.cluster_centers_mbk = MiniBatchKMeans(n_clusters=n_clusters, batch_size=45,
max_no_improvement=10, verbose=0)
t0 = time.time()
mbk.fit(X)
t_mini_batch = time.time() - t0
y_kmeans_batch = mbk.predict(X)
centers_b = mbk.cluster_centers_# Plotting
fig, ax = plt.subplots(1, 2, figsize=(15,7))
colors = ['#d35400', '#34495e', '#2980b9']
versions = ['K-means', 'MiniBatch K-means']
y = [y_kmeans, y_kmeans_batch]
t = [t_batch, t_mini_batch]
centers_ = [centers, centers_b]for i in range(2):
ax[i].scatter(X[:, 0], X[:, 1], s=50, c=list(map(lambda x: colors[x], y[i])), alpha=0.5)
ax[i].scatter(centers_[i][:, 0], centers_[i][:, 1], marker='+', c='black', s=200)
ax[i].set_title(versions[i] + f' | train time: {np.round(t[i], 2)}fs')
ax[i].set_xticks([])
ax[i].set_yticks([])
plt.show()

Fuzzy c-means

Originally, K-means is a hard clustering (i.e. clusters do not overlap). What if samples belong to multiple clusters? Then, we need algorithms that perform soft clustering (i.e. each sample has a probability of belonging to a cluster).

The process is the same with K-means but the centroid of a cluster is calculated as the mean of all points, weighted by their probability of belonging to cluster i.

import skfuzzy as fuzz
from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt# Generate data
centers = [[1, 1], [-1, -1], [1, -1]]
n_clusters = len(centers)
X, y_true = make_blobs(n_samples=3000, centers=centers, cluster_std=0.7)# Fuzzy c-means
cntr, u, u0, d, jm, p, fpc = fuzz.cluster.cmeans(X.T, n_clusters, 2, error=0.005, maxiter=1000)# Plotting
colors = ['#d35400', '#34495e', '#2980b9']
fig, ax = plt.subplots(1, 2, figsize=(15,7))ax[0].scatter(X[:, 0], X[:, 1], s=50, c=list(map(lambda x: colors[x], y_true)), alpha=0.5)
ax[0].set_title('Original Data')
ax[0].set_xticks([])
ax[0].set_yticks([])cluster_membership = np.argmax(u, axis=0)
for j in range(n_clusters):
cX = X[cluster_membership == j]
ax[1].scatter(cX[:, 0], cX[:, 1], s=50, c=colors[j], alpha=0.5)for pt in cntr:
ax[1].scatter(pt[0], pt[1], marker='+', c='black', s=200)ax[1].set_title(f'Fuzzy c-means | FPC {np.round(fpc, 2)}')
ax[1].set_xticks([])
ax[1].set_yticks([])

FPC (fuzzy partition coefficient) is a performance metric that ranges from 0 to 1, with 1 being the best. In our case, we achieved FPC of 0.68. In our problem, each point belongs to a single cluster, thus np.argmax(u, axis=0) , however, one can investigate further the probabilities of membership for each cluster. For example, we could define a threshold and find which samples belong to multiple clusters or a single cluster after thresholding.

K-medoids

Say hello to K-means cousin, K-medoids. Probably you have already guessed that there are not too many differences between the 2 algorithms, and you are right. The main differences are the following:

The centers (medoids) are now actual data points from our dataset — something which is not a necessity in K-means.
> greater interpretability of the cluster centers than in K-means
Instead of Euclidean distance, K-medoids is minimizing the sum of pairwise dissimilarity measures.
> more robust to outliers and noise

In the following script, we investigate the results of K-medoids using 3 pairwise dissimilarity measures — Manhattan, Cosine and Euclidean — and K-means.

from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
from sklearn_extra.cluster import KMedoids
import matplotlib.pyplot as plt# Generate data
centers=[[1,1], [-1,-1], [1,-1]]
n_clusters = len(centers)
X, y_true = make_blobs(n_samples=3000, centers=centers, cluster_std=0.7)# K-medoids models
selected_models = [(KMedoids(metric="manhattan", n_clusters=n_clusters), "KMedoids (manhattan)"),
(KMedoids(metric="euclidean", n_clusters=n_clusters), "KMedoids (euclidean)"),
(KMedoids(metric="cosine", n_clusters=n_clusters), "KMedoids (cosine)"),
(KMeans(n_clusters=n_clusters), 'KMeans')]colors=['#d35400', '#34495e', '#2980b9']fig, ax = plt.subplots(2, 2, figsize=(15, 15))
perm = list(set(itertools.permutations([0,1,1,0], 2)))for s in range(4):
i, j = perm[s]
model, name = selected_models[s]
model.fit(X)
y_hat = model.predict(X)
centers_hat = model.cluster_centers_
inertia = model.inertia_
ax[i, j].scatter(X[:,0], X[:, 1], s=50, c=list(map(lambda x: colors[x], y_hat)), alpha=0.5)
ax[i, j].scatter(centers_hat[:, 0], centers_hat[:, 1], marker='+', c='black', s=200)
ax[i, j].scatter(np.array(centers)[:, 0], np.array(centers)[:, 1], marker='o', c='black', s=200, alpha=0.6)
ax[i, j].set_title(name)
ax[i, j].set_xticks([])
ax[i, j].set_yticks([])
plt.show()

In each plot the centers of the clusters are represented with a cross, while the true centers are presented with a circle. Given this information, by comparing all 4, we can see that cosine is performing the worst compared to the rest. The K-medoids using Euclidean distance gives very similar results with K-means, which was expected. Lastly, Manhattan distance is giving very similar results with Euclidean distance.

K-means

MiniBatch K-means

Fuzzy c-means

K-medoids

Footer