By Zachary Galante — Senior Data Science Student at Bryant University
This article will be a tutorial of how to use KNN to predict if a tumor is malignant (cancerous) or benign (non-cancerous).
What is KNN?
KNN is a very basic Machine Learning algorithm that uses surrounding data to predict on new data. As shown in the image below by the question mark, it represents new data (or the test case) for the algorithm to classify. It then takes into account the classes and the distance of it’s neighbors make predictions for the testing data.
With KNN, distance measures are extremely important, as they determine exactly how close certain neighbors are, which then also has an impact on the classification as well. In this example Euclidean distance is going to be used, as it is also one of the most popular distance measures in Machine Learning. The formula for Euclidean Distance is shown below, as it takes the difference between the two x and y points, and then squares them to eliminate negative values. This is a basic yet dependable distance measure. There are also more distance measures such as Manhattan, and Jaccard distance. Jaccard distance is a very popular distance measure in the Deep Learning community, and specifically Convolutional Neural Networks for image recognition.
Now that we have an understanding of how KNN works, we are now able to implement a model in Python. Below, the data set is being loaded and the features are shown. Note that the target variable is the “diagnosis_result” field.
The following graph explores the target variable in more depth, showing the count of the malignant and benign records.
Before the data can be passed into the model, there is a little more preprocessing that must be done in order for the model to run correctly. With KNN all features must be numerical, as it cannot calculate the distance between strings. In the screenshot below, it shows all datatypes for the dataset, notice how the only feature that is not numeric is our target variable “diagnosis_result”
To change this, the target variable is going to be changed to binary with the Label Encoder function. This will take the values “M” and “B” and automatically assign them to 0 and 1.
In the image below, the data is being split into ‘x’ and ‘y’ data sets. In the first cell, everything but the target variable is being selected as input data, this will be the ‘x’ data set. Next, the ‘y’ data set is being selected as the diagnosis_result feature. Next, as explained previously the data for the target variable is being converted to binary, and then reassigned as ‘y’. Finally, the data is split into testing and training sets. Since it is not explicitly stated, the data is being split using a split of 75% training and 25% testing.
Running The Model
Now that the data is in the correct format and is broken down into training and testing sets, the model can now be trained and tested. A loop is being used to test which number of neighbors is appropriate to use to achieve the best scores for the model.
Evaluating The Model’s Performance
In the graph shown, it illustrates that the best number of neighbors for the model is 2 because it results in both the testing and training sets to have the highest accuracy scores. When using 3 neighbors, the model achieves the same testing score, however the training score is lower, meaning that 2 neighbors is the best fit for the model.
Now that the model has been evaluated with different parameters, it can be run with the optimal number of neighbors which was determined previously to be 2. As the results show, the model ended up running quite well as it resulted in a 84% testing score.
As seen throughout the article, having more neighbors is not always beneficial for the model, as it results in the accuracy improving to a certain point, but then declining. Finding that optimal level is extremely important to achieve the best results for a model.