Another use case for clustering is in semi-supervised learning, when we 🧑✈️ have plenty of unlabeled instances and very few labeled instances. Let’s train a logistic regression model on a sample of 50 labeled instances from the digits dataset:
n_labeled = 50log_reg = LogisticRegression()log_reg.fit(X_train[:n_labeled], y_train[:n_labeled])
Performance of model on test set:🧑🔧
>>> log_reg.score(X_test, y_test)0.8266666666666667
Accuracy is just 82.7%: 😻 it should come as no surprise that this is much lower than earlier, when we trained the model on the full training set. Let’s see how we can do better.
k = 50
kmeans = KMeans(n_clusters=k)X_digits_dist = kmeans.fit_transform(X_train)representative_digit_idx = np.argmin(X_digits_dist, axis=0)X_representative_digits = X_train[representative_digit_idx]
Now Let’s look at the image:🧑💻
y_representative_digits = np.array([4, 8, 0, 6, 8, 3, ..., 7, 6, 2, 3, 1, 1])
Let’s see if the performance is any better:
>>> log_reg = LogisticRegression()>>> log_reg.fit(X_representative_digits, y_representative_digits)>>> log_reg.score(X_test, y_test)
0.9244444444444444
Wow! that great We jumped from 82.7% accuracy to 92.4%,👨💻 although we are still only training the model on 50 instances. Since it is often costly and painful to label instances, especially when it has to be done manually by experts, it is a good idea to label representative instances rather than just random instances.
What if we propagated the labels to all the other instances in the same cluster? This is called label propagation.🧑🔧
y_train_propagated = np.empty(len(X_train), dtype=np.int32)
for i in range(k):
y_train_propagated[kmeans.labels_==i] = y_representative_digits[i]
It’s training time,
>>> log_reg = LogisticRegression()>>> log_reg.fit(X_train, y_train_propagated)>>> log_reg.score(X_test, y_test)
0.9288888888888889
We got tiny little accuracy boost. At-least better than nothing, but not 🧑✈️ astounding. Let’s see what happens if we only propagate the labels to the 20% of the instances that are closest to the centroids:
percentile_closest = 20X_cluster_dist = X_digits_dist[np.arange(len(X_train)), kmeans.labels_]for i in range(k):
in_cluster = (kmeans.labels_ == i)
cluster_dist = X_cluster_dist[in_cluster]
cutoff_distance = np.percentile(cluster_dist, percentile_closest)
above_cutoff = (X_cluster_dist > cutoff_distance)
X_cluster_dist[in_cluster & above_cutoff] = -1partially_propagated = (X_cluster_dist != -1)X_train_partially_propagated = X_train[partially_propagated]y_train_partially_propagated=y_train_propagated[partially_propagated]
Let’s train the model again and look at performance,👩🚀
>>> log_reg = LogisticRegression()>>> log_reg.fit(X_train_partially_propagated, y_train_partially_propagated)>>> log_reg.score(X_test, y_test)
0.9422222222222222
Nice! with just 50 labeled instances, we got 94.2% performance, which is pretty close to the performance of logistic regression on the fully labeled digits dataset. This happened because the propagated labels are actually pretty good, their accuracy is very close to 99%.🧑🚀
>>>np.mean(y_train_partially_propagated==y_train[partially_propagated])
0.9896907216494846