The very first features that we should remove from our dataset are constant ones. It might seem easy to perform manually but suppose you have 200–300 features. In that case, using some technique make sense.
Firstly, let import all the required libraries. Pandas
is used to create and manipulate the dataset. VarianceThreshold
is used to remove the features that have low variance. Variance is just a measure of variability. 0 variance implies all values are same/constant. Train_test_split
is used for splitting the data for training and testing purposes.
import pandas as pd
from sklearn.feature_selection import VarianceThreshold
from sklearn.model_selection import train_test_split
Here, we have just created a dummy dataframe to understand this feature selection easily. We can see that columns C and D have constant values hence they should be removed.
df = pd.DataFrame({"A":[2,3,5,2,6],
"B":[4,6,4,2,8],
"C":[5,5,5,5,5],
"D":[0,0,0,0,0]})
df
Now, we have initialized the object var
of VarianceThreshold with the threshold as 0(constant). We can vary this according to our needs. Then, we have printed the array of boolean in which true
and false
represent non-constant and constant features respectively.
var = VarianceThreshold(threshold=0.0)
var.fit(df)
var.get_support()
We have used list comprehension in which we are looping through all the columns and inserting the constant features into the list named constant_features
. Finally, we have dropped the constant features from the dataset.
constant_features = [i for i in df.columns
if i not in df.columns[var.get_support()]]
print(constant_features)
df.drop(constant_features,axis=1)
The previous example may seem too easy and can be easily detected. But in real-world scenarios, there will be 100’s or 1000’s of features. This technique will help in those cases. Let us see the same using the bigger and real dataset.
Here, we have loaded our dataset that is available on GitHub. The dataset consists of 371 features.
df = pd.read_csv("train.csv")
print(df.shape)
Here, X represents input variables and Y represents the target variable. We should always apply any Feature Selection technique to the training set to avoid any kind of overfitting. That’s why we are splitting the dataset into training and testing. After that, the process is the same as explained above. Here, 371–332 = 39 features are constant. So, we drop these features from both training and testing sets.
X = df.drop(labels=['TARGET'],axis=1)
Y = df['TARGET']
x_train,x_test,y_train,y_test = train_test_split(X,Y,test_size = 0.2)
var = VarianceThreshold(threshold=0.0)
var.fit(x_train)
sum(var.get_support())constant_features = [i for i in x_train
if i not in x_train.columns[var.get_support()]]
x_train.drop(constant_features,axis=1)