In this part we need to analyze general population and customer segment data sets and use unsupervised learning techniques to perform customer segmentation, identifying the parts of the population that best describe the core customer base of the company.
Principal Component Analysis (PCA)
Due to the large size of the data, I have used principal component analysis (PCA) technique for dimensionality reduction. Other reasons for using PCA are :
- By reducing the number of features, we’re improving the performance of our algorithm.
- On top of that, by decreasing the number of features the noise is also reduced.
As a rule of thumb, we want to preserve 80% of the variance. This way we do not lose critical information from dataset while reducing dimensions. Based on above chart we can see that at around 200 components, cumulative variance is around 90%.
K-Means Clustering
With the dimension reduced, let’s do clustering, now we need to determine the number of clusters. We run the algorithm with a different number of clusters. Then, we determine the Within Cluster Sum of Squares (WCSS) for each solution. Based on the values of the WCSS and an approach known as the Elbow method, we make a decision about how many clusters we’d like to keep.
From this graph, we determined the number of clusters could be between 3 to 6 clusters. I have computed the K-Means with all the possibilities from 3 to 6 clusters and I have chosen 3 clusters.