Classification is one of the most fundamental concepts in data science. It is a machine learning method by which a class label is predicted for an input of data using predictive modeling. Classification algorithms are predictive calculations used to assign data to preset categories by analyzing sets of training data.
A proper data classification allows one to apply appropriate controls based on that predetermined category data. Classification of data can save time and resources because one is able to focus on what’s important, and not waste time putting unnecessary controls in place.
There are 4 main types of classification in machine learning:
- Binary classification: two class labels; provides a yes or no answer — ex: identifying spam email
- Multi class classification: more than two class labels — ex: classifying faces or plant species
- Multi label classification: two or more class labels where one or more labels are predicted — ex: photo classification
- Imbalanced classification: unequally distributed — ex: fraud detection or medical diagnosis
Several algorithms used commonly in data classification include:
- K-Nearest Neighbor
- Logistic Regression
- Artificial Neural Networks/Deep Learning
- Support Vector Machines
I’ll be discussing one of the most fundamental and well known machine learning algorithms used in classification: the K-nearest neighbors algorithm (KNN).
K-nearest neighbors classifier
KNN classifies the new data points based on the similarity measure of the earlier stored data points. This algorithm finds the distances between a query and all the examples in the data, selecting the specified number examples (K) closest to the query, then votes for the most frequent label (for classification) or averages the labels (for regression).
This algorithm is used by many in the industry due to its low calculation time and its non-parametric nature (meaning it makes no assumptions about the data). A common application of KNN can be seen in Concept Searches. These are software packages used to help companies locate similar documents in an email inbox, for example. Another instance is with Recommender Systems where an algorithm can recommend products, media, or advertisements for an individual based on their previous purchases or engagements.
While the KNN algorithm can be relatively easy to use and train, the accuracy of the KNN classifier will depend on the quality of the data and the specific K value chosen.
When implementing a KNN classifier, data scientists must also decide how many neighbors to consider. In other words, we need to consider the optimal number of neighbors and how it impacts our classifier. The optimal K value (the number of neighbors considered) will impact the prediction model. Different data sets have different requirements. According to datacamp.com, in the case of a small number of neighbors, the noise will have a higher influence on the result, and a large number of neighbors make it computationally expensive. Research has also shown that a small number of neighbors are the most flexible fit which will have low bias but high variance and a large number of neighbors will have a smoother decision boundary which means lower variance but higher bias.
When and Why use KNN?
KNN is best applied to datasets when they are labelled, noise-free, and relatively small. Given the classifications of data points in a training set, the algorithm can classify future unknown data based on this information. In addition, data sets with excess features that don’t contribute to the classification of the data points may cause the algorithm to miss patterns in the data. These noise in the data set can include extraneous data points that don’t relate with the rest of the dataset and features that don’t help in identifying the classification. Because the KNN algorithm is instance based, meaning no explicit training step is required, the training stage is relatively fast as compared to other methods. Therefore, with datasets with homogeneous features and few outliers and missing values, the KNN classifier can prove to be an accurate classifier.
The Math behind KNN
After transforming the data points from a dataset into their mathematical components, the KNN algorithm calculates the distance between different points to make correlations. A common method to find this distance is to use the Euclidean distance between two points.
In a two-dimensional field, the points and distance can be calculated as below:
This principle can be applied to a ‘multidimensional’ space where each dimension represents a different feature of the data points. Below, the three-dimensional space shows the Euclidean distances calculated between points.
When the space consists of n-dimensions, the following can be used to calculate the Euclidean distance:
There is a limit to the number of dimensions where these Euclidean distance calculations yield best classification results. The results may be diluted and not useful as the number of dimensions increases, so care must be taken to test datasets to check its validity.
Computing Euclidean Distance in Python
#import the library
from scipy.spacial import distance#initialize points
p1 = (10, 15, 20)
p2 = (25, 30, 35)#calculate euclidean distance
euclidean_distance = distance.euclidean(p1, p2)
print('The Euclidean distance b/w', p1, 'and', p2, 'is: ', euclidean_distance)
Simple Example using K-nearest neighbors (KNN) — Iris Data
The following example will utilize data from an Iris Flower Dataset, often known as Fisher’s Iris dataset, which I accessed from the UCI Machine Learning Repository. This multivariable dataset contains measurements of 50 samples of three different species of Iris: Iris Setosa, Iris virginica, and Iris versicolor. Each sample included four features measured in centimeters: the length and the width of the sepals and petals. In all, this set contains 150 records with 5 features (4 measurements and the classification / species).
I started by importing libraries used throughout the program. The sklearn.model_selection library is a popular machine learning library which I will use for classification.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
After loading and reading the data, we can preview the dataset.
#read csv file into a data frame
iris = pd.read_csv('iris.csv')#display initial rows of data frame
iris.head()
iris_species = {'setosa': 0, 'versicolor': 1, 'virginica': 2}iris['species_num'] = iris.apply(lambda x: iris_species[x.species], axis=1) # axis=1 - applies to each row
I enumerated the names of the species in the ‘species’ column to convert the names into numerical values based on the array above. This will be used to create a mapping from iris label value to the species name to make the results easier to interpret.
iris_number = dict(zip(iris.species_num.unique(), iris.species.unique()))
Here is the mapping:
X = iris[['sepal_length', 'sepal_width', 'petal_length']]
y = iris['species_num']# default is 75% / 25% train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
In the segment of code above, the sepal length, sepal width, and petal length features of each flower used to train. Then, the dataset is split into a test subset and a training subset.
Below, using the training set separated previously, I plotted a 3D scatter plot to visualize relationship between the features.
# plotting a 3D scatter plot
from mpl_toolkits.mplot3d import Axes3Dfrom matplotlib import cmfig = plt.figure()
ax = fig.add_subplot(111, projection = '3d')
ax.scatter(X_train['sepal_length'], X_train['sepal_width'], X_train['petal_length'], c = y_train, marker = 'o', s=100)
ax.set_xlabel('sepal_length')
ax.set_ylabel('sepal_width')
ax.set_zlabel('petal_length')
plt.show()
from sklearn.neighbors import KNeighborsClassifierknn = KNeighborsClassifier(n_neighbors = 5)
After creating a classifier object, I defined the K value, or the number of neighbors to be considered.
knn.fit(X_train, y_train)
Using the training data, the classifier is trained to fit the estimator.
Then, we can estimate the accuracy of the developed classifier with the test data.
knn.score(X_test, y_test)
As noted above, the KNN algorithm can accurately classify data points of a dataset relatively easily. For reasonably sized datasets with few outliers, this method is a reliable algorithm which can yields results in a relatively short period of time as compared to other models.