Supervised and unsupervised machine learning are two important streams of machine learning (there are others also like reinforcement learning, semi-supervised learning etc., but here we need not to bother about those). In supervised learning we are given, what is called a ‘labelled’ data, which is generally in the form of input-output pairs and the task of machine learning is to learning the mapping between these input-out pairs. Linear regression, in which we the output linearly depends on the fitting parameters, is the most commonly known type of supervised learning. Other examples are logistic regression, support vector machine (svm), decision trees and neural network.

Clustering is the most basic and common example of unsupervised learning, in which unlabelled data (only input given) is grouped into a user defined numbers of clusters on the basis of some proximity criteria. Clustering plays an important role in many practical applications such as dimensionality reduction and outlier and anomaly detections.

The idea and algorithm behind K-means clustering is quite simple : we partition the data into ‘K’ number of clusters, where ‘K’ is supplied by the user.

The steps involved in K-means clustering are as follows:

- Chose k number of random points as centers (centroids) of k-clusters.
- Associate each point to one of the clusters on the basis of its proximity to the centers of the clusters.
- Recompute centroids on the basis of the data point belonging to different clusters.
- Keep repeating the procedure till centroids do not stop changing their values.

The full code is given here and here I just want to make some comments about the code. The code does not use any library and builds the logic from the scratch, however, it does need you understand the use of the following three python libraries.

- Numpy — Numpy is used for generating random numbers, computing mathematical functions and doing some simple array mathematics like finding the maximum/minimum points in an array.
- Pandas — Pandas data frames are used to represent the data which makes it very easy to move around and manipulate data. You can appreciate the use of pandas data frame once you start using those. Here I add one extra columns in the data frame for the cluster membership of the data points.
- Matplotlib — This library is used for showing the scatter plot of the data points.