KNN is one of the easiest and common algorithms in practice. You must have heard or implemented this technique before. It is always better to understand the techniques intuitively in any field, so in this article, I tried implementing KNN from scratch with only knowing the theory knowledge.
For folks who don’t know what is KNN, I will summarise it in few words.
Suppose If you have a 2D dataset with 2 numeric features and two class labels. Let’s say If you plot a ScatterPlot of this data and you get something like this.
From the plot, you can easily classify two classes and also If I will ask you that a new point Q given near the Blue cluster can you tell me to which class this belongs?
You will quickly say Blue Class. We can say that only by looking at the position of the new data point. But you can you tell how exactly do you approach this particular conclusion? Yes, your answer is KNN!
So what KNN basically does is that it calculates the K nearest neighbours of the query datapoint Given (In this case Q). After finding the neighbours Q is considered in that class in which the majority of its neighbours belong. We can take the value of K which is best suited according to the dataset. Mostly we choose the odd value.
KNN is widely used for ML purposes but it mostly depends on the data we are working on. A specific model is not created in KNN but the training dataset is the model! that’s why KNN is called instance-based learning.
For implementing KNN from scratch we will do the following tasks:
- Import dummy data for testing purposes.
- Create class KNN and implement methods for the same.
- Finalizing results
Let us start implementing our Algorithm step by step.
- Import data for testing purposes.
For checking the working of our model before and in between we will use the famous Iris dataset. Iris dataset is the classification of the three flowers by considering the length and width of their petal and sepals. After importing the data we will classify the data for training and testing purposes in 80–20 proportion.
from sklearn import datasetsfrom sklearn.model_selection import train_test_splitiris = datasets.load_iris()
X, y = iris.data, iris.targetX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
2. Create class KNN and implement methods for the same.
First, we will create a class KNN that takes the value of K.
class KNN: def __init__(self,K=5): super().__init__() self.K=K
After creating the class KNN we will need to create two methods for the same which are generally used in the practice. These methods are .fit and .predict
Fit- Specific for KNN what the fit method does is that it stores the training data and target in the model. As I said earlier KNN is an instance-based learning model that means it learns the data specifically for a query point at run time only. This means that our fit method will contain only storing the data for a specific object.
def fit(self,X,y): self.XTrain=X self.yTrain=y
Predict- The crux of our algorithm lies in this function. What we want to get is the distances of all the points from our Query point. We may have a single Query point or iterable data with multiple queries in it, we will handle both types of data. First, we will iterate these queries and find all the distances and store them in a list.
def predict(self,X): predicted= #To store final results for single_X in X: distances=[dist(single_X,tP) for tP in self.XTrain] minimum=np.argsort(distances)[:self.K]
In the above code, I have called a function named dist() for calculating euclidian distance you will need to create this simple function that takes two points as inputs and gives the distance between them. You can create this function on your own or refer to my whole code for the same. You might think what the last line does in the above code. The argsort() function basically returns the original indices in the sorted list we passed in as an argument. Also, we have done list slicing. In short, this line will return the indices of the smallest K distances from our list of distances.
Now, the last thing we have to do is find the majority of the points present in our list. We will find the class of points from their indices which are present in our yTrain and append the most occurred class label in our final list of output. This most occurred class label is nothing but the output of the single point we processed on.
labels=[self.yTrain[i] for i in minimum]predicted.append(Counter(labels).most_common())
Now we will return the final list of outputs.
Bam!! We just completed coding our algorithm!
3. Finalizing results
Now it is time to check how it works! Remember we had split the data into 4 parts now it’s time to use y_train and y_test. We will pass y_train to our model and our model will give us the predicted values, later we will check how many values are actually predicted correctly.
KNNmodel = KNN(K=5)KNNmodel.fit(X_train, y_train)predictions = KNNmodel.predict(X_test)print("Accuracy-", accuracy(y_test, predictions))
I have created a simple accuracy() function that tells the percentage of similar values in both lists.
On the first attempt, I got 96.66% accuracy but later I found that accuracy is always 100% for K=3, so we will use the value of K=3 only.
So that’s how we successfully implemented K- nearest neighbours. This was the simplest method we followed. We can still improve our algorithm for example we can do error handling more precisely. Next, I will try to implement variations of KNN from scratch. I have provided a Github link for the full code. Any suggestions or modifications from your side will be very much appreciated.