Classification is one of the most fundamental concepts in data science. It is a machine learning method by which a class label is predicted for an input of data using predictive modeling. Classification algorithms are predictive calculations used to assign data to preset categories by analyzing sets of training data.

A proper data classification allows one to apply appropriate controls based on that predetermined category data. Classification of data can save time and resources because one is able to focus on what’s important, and not waste time putting unnecessary controls in place.

There are 4 main types of classification in machine learning:

- Binary classification: two class labels; provides a yes or no answer — ex: identifying spam email
- Multi class classification: more than two class labels — ex: classifying faces or plant species
- Multi label classification: two or more class labels where one or more labels are predicted — ex: photo classification
- Imbalanced classification: unequally distributed — ex: fraud detection or medical diagnosis

Several algorithms used commonly in data classification include:

- K-Nearest Neighbor
- Logistic Regression
- Artificial Neural Networks/Deep Learning
- Support Vector Machines

I’ll be discussing one of the most fundamental and well known machine learning algorithms used in classification: the** K-nearest neighbors algorithm (KNN)**.

**K-nearest neighbors classifier**

KNN classifies the new data points based on the similarity measure of the earlier stored data points. This algorithm finds the distances between a query and all the examples in the data, selecting the specified number examples (K) closest to the query, then votes for the most frequent label (for classification) or averages the labels (for regression).

This algorithm is used by many in the industry due to its low calculation time and its non-parametric nature (meaning it makes no assumptions about the data). A common application of KNN can be seen in *Concept Searches*. These are software packages used to help companies locate similar documents in an email inbox, for example. Another instance is with *Recommender Systems* where an algorithm can recommend products, media, or advertisements for an individual based on their previous purchases or engagements.

While the KNN algorithm can be relatively easy to use and train, the accuracy of the KNN classifier will depend on the quality of the data and the specific K value chosen.

When implementing a KNN classifier, data scientists must also decide how many neighbors to consider. In other words, we need to consider the optimal number of neighbors and how it impacts our classifier. The optimal K value (the number of neighbors considered) will impact the prediction model. Different data sets have different requirements. According to datacamp.com, in the case of a small number of neighbors, the noise will have a higher influence on the result, and a large number of neighbors make it computationally expensive. Research has also shown that a small number of neighbors are the most flexible fit which will have low bias but high variance and a large number of neighbors will have a smoother decision boundary which means lower variance but higher bias.

## When and Why use KNN?

KNN is best applied to datasets when they are labelled, noise-free, and relatively small. Given the classifications of data points in a training set, the algorithm can classify future unknown data based on this information. In addition, data sets with excess features that don’t contribute to the classification of the data points may cause the algorithm to miss patterns in the data. These noise in the data set can include extraneous data points that don’t relate with the rest of the dataset and features that don’t help in identifying the classification. Because the KNN algorithm is instance based, meaning no explicit training step is required, the training stage is relatively fast as compared to other methods. Therefore, with datasets with homogeneous features and few outliers and missing values, the KNN classifier can prove to be an accurate classifier.

**The Math behind KNN**

After transforming the data points from a dataset into their mathematical components, the KNN algorithm calculates the distance between different points to make correlations. A common method to find this distance is to use the **Euclidean distance** between two points.

In a two-dimensional field, the points and distance can be calculated as below:

This principle can be applied to a ‘multidimensional’ space where each dimension represents a different feature of the data points. Below, the three-dimensional space shows the Euclidean distances calculated between points.

When the space consists of n-dimensions, the following can be used to calculate the Euclidean distance:

There is a limit to the number of dimensions where these Euclidean distance calculations yield best classification results. The results may be diluted and not useful as the number of dimensions increases, so care must be taken to test datasets to check its validity.

**Computing Euclidean Distance in Python**

#import the library

from scipy.spacial import distance#initialize points

p1 = (10, 15, 20)

p2 = (25, 30, 35)#calculate euclidean distance

euclidean_distance = distance.euclidean(p1, p2)

print('The Euclidean distance b/w', p1, 'and', p2, 'is: ', euclidean_distance)

**Simple Example using K-nearest neighbors (KNN) — Iris Data**

The following example will utilize data from an Iris Flower Dataset, often known as Fisher’s Iris dataset, which I accessed from the UCI Machine Learning Repository. This multivariable dataset contains measurements of 50 samples of three different species of Iris: Iris Setosa, Iris virginica, and Iris versicolor. Each sample included four features measured in centimeters: the length and the width of the sepals and petals. In all, this set contains 150 records with 5 features (4 measurements and the classification / species).

I started by importing libraries used throughout the program. The sklearn.model_selection library is a popular machine learning library which I will use for classification.

`import numpy as np `

import pandas as pd

import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split

After loading and reading the data, we can preview the dataset.

#read csv file into a data frame

iris = pd.read_csv('iris.csv')#display initial rows of data frame

iris.head()

iris_species = {'setosa': 0, 'versicolor': 1, 'virginica': 2}iris['species_num'] = iris.apply(lambda x: iris_species[x.species], axis=1) # axis=1 - applies to each row

I enumerated the names of the species in the ‘species’ column to convert the names into numerical values based on the array above. This will be used to create a mapping from iris label value to the species name to make the results easier to interpret.

`iris_number = dict(zip(iris.species_num.unique(), iris.species.unique()))`

Here is the mapping: