## K-means

The algorithm aims to partition the data points into k sets (i.e.* the clusters*) by minimizing the set variances. The latter simply means minimizing the distance from the centroid (center of a cluster).

Oftentimes, it is the case that we have additional information about our data points. Perhaps, each sample belongs to a certain group i.e. male/female, long term/new user, countries etc. We could therefore assign a weight to each sample given the group they belong to. For this purpose, we use **Weighted K-means**.

from sklearn.cluster import KMeans

from sklearn.datasets import make_blobs

import matplotlib.pyplot as plt# Generate data

centers = [[1, 1], [-1, -1], [1, -1]]

n_clusters = len(centers)

X, y_true = make_blobs(n_samples=3000, centers=centers, cluster_std=0.7)# K-means

kmeans = KMeans(n_clusters=n_clusters)

kmeans.fit(X)

y_kmeans = kmeans.predict(X)

centers = kmeans.cluster_centers_# Weighted K-means

weights = list(map(lambda x: x*10, y_true))

kmeans = KMeans(n_clusters=n_clusters)

kmeans.fit(X, sample_weight=weights)

y_kmeans_w = kmeans.predict(X, sample_weight=weights)

centers_w = kmeans.cluster_centers_# Plotting

fig, ax = plt.subplots(1, 3, figsize=(20,7))

colors = ['#d35400', '#34495e', '#2980b9']

versions = ['Original Data', 'K-means', 'Weighted K-means']

y = [y_true, y_kmeans, y_kmeans_w]

centers_ = [centers, centers_w]for i in range(3):

ax[i].scatter(X[:, 0], X[:, 1], s=50, c=list(map(lambda x: colors[x], y[i])), alpha=0.5)

elif i>0:

ax[i].scatter(centers_[i-1][:, 0], centers_[i-1][:, 1], marker='+', c='black', s=200)

ax[i].set_title(versions[i])

ax[i].set_xticks([])

ax[i].set_yticks([])

plt.show()