In the above, we implemented a model with a set of pre-determined inputs. The purpose of this section is to go beyond the baseline and see how differently we could set up the model and how that impacts clustering. We will do three different experiments:
- Features selection
- Normalization
- Algorithm
Feature selection
Like I said before, you could choose any number of features for creating clusters and in the baseline model we used just two. But what if we select three features instead and how does that impacts the clusters?
Let’s re-run the model with new features.
# feature selection
df = df[["petal_length", "petal_width", "sepal_length"]]# normalizing inputs
X = preprocessing.scale(df)# instantiate model
model = KMeans(n_clusters = 3)# fit predict
y_model = model.fit_predict(X)print("Value Counts")
print(pd.value_counts(y_model))# visualize clusters
plt.scatter(X[:,0], X[:,1], c=model.labels_, cmap='viridis')
If you compare this new model with the baseline you will see that we still have 3 clusters and the cluster at the bottom corner remains the same. The rest of the clusters have quite a few overlapping data points. This is because we are using a two-dimensional plane to represent 3-dimensional outputs. If we could create a 3D scatterplot the clusters probably wouldn’t be overlapping.
Normalization
Normalization requires a long discussion, but to make a long story really short, the purpose of normalization is to scale data within the same range, let’s say -2 to +2. The benefit of doing so is that it condenses highly scattered/dispersed data so that makes it easy to find clusters.
Let’s re-run with the new setup.
# feature selection
df = df[["petal_length", "petal_width"]]# inputs (NOT normalized)
X = df.values# instantiate model
model = KMeans(n_clusters = 3)# fit predict
y_model = model.fit_predict(X)# visualizing clusters
plt.scatter(X[:,0], X[:,1], c=model.labels_, cmap='viridis')# counts per cluster
print("Value Counts")
print(pd.value_counts(y_model))
Homework: Can you now compare the 3-feature outputs with the 2-feature outputs in the baseline? You can visually inspect in also see the value counts.
Algorithm
We talked about quite a few algorithms that can be used for clustering and each has advantages and disadvantages. There is no algorithm that is better or worse in *absolute* sense, it just depends on the underlying data structure and the feature space. In real-world you will need to experiment with several algorithms and see which one does a good job.
Let’s set up the model with a different algorithm, this time using hierarchical clustering, keeping others the same, and compare the results with the baseline model.
# import model
from sklearn.cluster import AgglomerativeClustering# feature selection
df = df[["petal_length", "petal_width"]]# normalizing inputs
X = preprocessing.scale(df)# instantiate model
model = AgglomerativeClustering(n_clusters=3, affinity='euclidean', linkage='ward')# fit/predict model
y_model= model.fit_predict(X)# ploting clusters
plt.scatter(X[:,0], X[:,1], c=model.labels_, cmap='viridis')# counts per cluster
print("Value Counts")
print(pd.value_counts(y_model))
Determine the number of clusters
This is a bonus part. So we kind of arbitrarily specified 3 clusters in the baseline model and subsequent experiments. It was easy to determine from visual inspection because it’s a tiny dataset. But actually, there is a data-driven approach to make that determination — called the Elbow method. The code snippet below may look a bit complex than it should be but don’t put an extra effort to understand it if you don’t want to.
# determine number of clusters using "elbow method"
k_range = range(1,10)
sse = [] # we want to minimize SSE
for k in k_range:
m = KMeans(n_clusters=k, random_state=0)
m.fit(X)
sse.append(m.inertia_)plt.xlabel("K")
plt.ylabel("Sum of squared errors")
plt.plot(k_range, sse, marker='x')
What the Elbow method does is help us heuristically decide the number of clusters based on a cutoff point (k = 3), beyond which adding more clusters does not significantly reduce the sum of squared errors. That’s the point that looks like a human elbow.
I’ve covered several different themes in this article, it’s time to summarize them:
- clustering is simple as a concept but needs help with machines to implement for a large and/or multi-dimensional dataset
- use cases are wide-ranging — from descriptive statistics, anomaly detection and recommendation systems design to biology, spatial statistics and urban planning
- several algorithms are in the market but popular ones are — K-means, hierarchical and DBSCAN clustering
- with cleaned data, running and interpreting cluster analysis with
sklearn
is an easy task - number of features, normalization and algorithms can make differences in how clustering is done
I hope this was a useful article. If you have comments feel free to write them down below. You can follow me on Medium, Twitter or LinkedIn.