Classification — Let’s understand the basics!

Machine Learning (Supervised Learning)

In my previous blog — Shades of Machine Learning — we discussed what are the two main types of machine learning algorithms. Just to brush up, we have Supervised Learning (where the target is known/ the data is labeled and the model works under this supervision) and Unsupervised Learning (where the target is not known/ the data is unlabeled so the algorithm doesn’t have any supervision).

In this blog, we’re just going to talk about Classification. We will be addressing some basic but important questions related to classification like — what does classification really mean? What kind of data can we classify and what can’t? What are some of the Classification Algorithms?

Before starting with the classification, let’s understand the different parts of a dataset and its relation with the algorithms in general.

Image by Author

I created a “hypothetical” dataset in the image above just to explain the theory (Eating pizza or coke or greens is totally your choice, please don’t hold me responsible for being fit/unfit! :P)

Dataset — Any data arranged in the form of rows and columns work for ML. The columns are divided into two types — Variables(could be one or more columns) and Target(always one column). The rows are our data points.
Target / Label — is the column that we want to predict. It is our resultant column and we want to know for future data. In this dataset, it’s column “Fit/Unfit” marked in blue. Our entire supervised learning is dependent on this one column because this is what we want to know.
Variables/ Features — are the columns other than the Target column. These columns help the ML model to predict the target for future data points. In this dataset, the variables are -> “Eats Pizza”, “Drinks Coke”, “Eats Greens” and “Workout”.
You’d be thinking that what kind of column is “Person”. Well, when we feed the data in an ML algorithm, there are some columns that we do not use because we don’t want our model to “overfit” or learn each and every scenario, instead we want the algorithm to understand the general pattern and create a model to predict the same. We will discuss more overfitting in future blogs.

So now that you’re familiar with how the datasets and algorithms relate, let’s come back to classification. As the name suggests, Classification means classifying the data on some grounds. It is a type of Supervised learning. In classification, the target column should be a Categorical column. If the target has only two categories like the one in the dataset above (Fit/Unfit), it’s called a Binary Classification Problem. When there are more than 2 categories, it’s a Multiclass Classification Problem. The “target” column is also called a “Class” in the Classification problem.

To classify something, we need finite categories. So we need to have a dataset with a Class label like — [0,1], [Pass, Fail], [Fit, Unfit], [Good, Better, Best], etc. If we have such type of values in the target column, we can use the Classification method to solve the problem. But if we have continuous numeric values in the target column like [100,23,44,46,44.7,24.8, … etc.], we cannot classify on such a dataset. In this case, we either convert this value into a class value, for example — {values >50 will be considered 1, and <50 is 0}, or we use other methods like regression to solve the problem which is out of the scope of this blog.

To understand this better, let’s take an example of the employee salary and other features dataset (below):

Data For Non-Classification Problem(Image By Author)

The above image consists of some data points, with Salary as our target variable. Now since Salary is a continuous number column (since salary is always numeric), we cannot treat this as a classification problem. But if we really want to treat this as a classification problem due to some reason, we can bin the target column into two categories for example — Salary>70,000 as High and Salary<70,000 as Low. After doing this, the dataset will look something like this:

Converting Regression to Classification Problem (Image by Author)

So now our data is ready to be treated as a Classification Problem. Of course, there are other data processing stuff which we need to do based on the different type of algorithms used, but we at least have a categorical target now to classify on.

Now that we know how the datasets can be related to ML models, and what datasets can be used as a Classification Problem, we should also know how to solve them, right?

There are a lot of algorithms in the world that are being used to solve classification problems, and even newer algorithms are being introduced every day. Some of these algorithms are — Decision Trees, Support Vector Machines(SVM), Random Forest, Gradient Boosted Trees(GBT), K-nearest neighbors (KNN), etc. The Decision Tree is the basic algorithm to understand the classification by everyone and also to understand how the trees work in machine learning. But I like to keep my blogs short, and Decision trees deserve an entire blog, so will stop here and will try to bring up my next blog on Decision trees soon! ❤

Meanwhile, you can check my previous blogs for the data preprocessing and ML project flow in the following order:

Data Science for Non-Data Scientists
Bridging the Gap between Business & Data Science
Data Science — Where do I start?
What’s inside the data!
Understand the Patterns in the Data
Feature Engineering — What to Keep and What to remove?
Different Shades of Machine Learning

I hope this blog helps someone understand the things they were not able to get earlier (and keep them questions coming)! 🙂

Machine Learning (Supervised Learning)

Footer