Classification is one of the major topics in machine learning. Some classification problems might not even have numbers to do analysis on.
In this article, I will be classifying IBM employee attrition using a neural network from Tensorflow.
About Datasets:
The problem has 8 non-number variables (categorical variables) like marital status, job role, education field, etc. of the IBM employee. The dataset has 35 attributes and 1470 rows.
Some common problems in classification are:
- Email Spam
- Speech Recognition
- Gestures Recognition
- Digit Recognition
- And the list goes on.
The classification type problem requires labeled datasets. So, sometimes solving a problem might involve collecting a large amount of data and labeling them.
For the problem, I have imported the following libraries from python.
For solving the employee attrition problem, I directly did not jump into the neural network. First, I binned the employee monthly salary and found the ratio of employee attrition to not employee attrition.
And in the plot, we can see that the employee attrition is very higher for low-income salary than for high-income salary.
Now, Let’s jump into the implementation of TensorFlow. First I am changing attrition Yes and No to 0 and 1 respectively and splitting the datasets into two sets: Training sets with 80% data and test sets with 20% data.
Then I am removing the attrition data from the training data frame and change them to the dictionary object in python where keys are column names and values are the Keras.input objects.
Now, we concatenate the numeric inputs together and run them through the normalization layer.
Then we keep the all_numeric_inputs to a list to concatenate later. Also, the column names are mapped to the integer which will be used as indices. In any confusion, each of the following functions can be searched online for in-depth understanding. The main motive here is to change the strings to the floating numbers so, analysis can be performed.
Running train_preprocessing (last line of above code) will give the first array of the arrays of the float data type.
Then, we can build a model on top of this and model can be fitted with the train_features_dict dictionaries as x and train_labels as y.
This gave me 88 percent of accuracy. And I further compared each value of x = data_model.predict(test_features_dict) with test_labels. I found most inaccurate values are close to almost the right prediction. So, it gives room for improving this model with better prediction.