This article Introduces concepts of unsupervised learning with a focus on implemeting some Clustering techniques using Scikit-Learn
Unsupervised learning is often the case in the real world, that data is unlabeled. You might apply an unsupervised learning technique to make unlabeled data self sufficient. For example, if you want to identify photos of a specific individual, you might feed a model lots of different photographs, millions of them until it starts identifying similar features. Unsupervised learning techniques are also used for latent factor analysis, anomaly detection, quantization, especially with colors. or as pre-training for supervised learning problems, such as classification and regression. Of all of the unsupervised learning ML techniques, there are two that are very popular and widely used, autoencoders and clustering.
Supervised Vs Unsupervised Learning
machine-learning algorithms fall into two broad categories, supervised and unsupervised learning.
- Supervised learning algorithms seek to learn the function ‘F’ that links the input features with the output labels. So you can think of supervised learning as this complex reverse engineering problem where the model is trying to figure out what exactly this ‘F’ is that links the input to the output. linear regression is an example of supervised learning.
- Unsupervised learning does not have y variables or a corpus that has been labeled correctly, which means you only have raw features in your input data, and everything you do just uses those raw features. There are no labels or predictions associated with those features.
Clustering is an example of an unsupervised learning technique where we don’t work with the labeled corpus to train our model. Clustering works directly with the features in your data and tries to find patterns and logical groupings in the underlying dataset. Clustering is applicable in a wide range of use cases, such as finding the relevant documents in a corpus, color quantization…..
When you’re working with clustering or any other unsupervised learning technique, you only have the input X data, you do not have the output predictions or labels. What you are trying to do when you’re working with unsupervised learning is to model or learn about the underlying structure in data to understand the data better to find patterns. Algorithms in unsupervised learning discover patterns and structure in the data by themselves. They have to be just set up right.
Choosing Clustering Algorithms:
it’s very easy to implement a particular kind of clustering on your data. What is more important and what needs understanding is choosing the right clustering algorithm based on your use case and understanding the parameters that you’ll use to tweak and design that algorithm.
you’ll choose a clustering algorithm based on two important factors, the number of clusters you want to form in your data and the size of the original dataset. You could have a very small dataset, anything under 1000 rows would be considered small, or a medium-sized dataset, or a very large dataset with millions of records, let’s say you have a very large dataset with millions of records and you want to categorize the data in this set into many clusters. The clustering algorithm that you might choose to use are either the BIRCH or agglomerative clustering algorithms in the same way you can see in the image bellow the appropriate algorithms to use according to the criterions mentioned