While creating data science project we all follow data science life cycle which consists of various phases includes
· Data gathering
· Feature engineering
· Feature selection
· Model creation
· Tuning of model
· Evaluation matrices
To say our project a data science project we must include all above phases. But here we are going to discuss an important and time-consuming phase that is feature engineering.
As per my experience feature engineering take more approx. 30–40 percent of all time taken by a project.
Let’s talk about feature engineering.
Feature engineering simple means extracting, manipulating, analyze and finding relevant data from our dataset which helps our model to give better result.
In real time, data is not as simple as we use for learning purpose, they contain lots of missing values, units and magnitude of columns are different, there may be outliers, features may not follow normal distribution and also possibility of imbalance data set. To handle all this thing comes under feature engineering.
Steps include in feature engineering
It consists of removing null values or replace by some relevant value in such a way that it increases the feature importance.
Ways to deal with null values:
· Drop null values, it is an easiest way but not to try it soon.
· Replace it by mean/mode/median if data is numeric in nature.
· Replace with most frequent values in case of data is categorical.
· Can use imputation techniques to fill null values.
· Sometimes backward and forward fill be helpful to deal with null values.
· Apply classifier algorithm to predict missing data in case of categorical variable.
There are other ways also to deal with null values, above are some common techniques to handle null values. If you know please give your response.
Fighting with Outliers
An outlier is a data point in a data set that is distant from all other observations. A data point that lies outside the overall distribution of the data set.
Sometimes outliers are important to keep depend on problem statement, let’s say in stock market sometimes price of any company stock may rise exponentially that became an outlier, but we cannot remove that because in stock price prediction we need trend of last n days.
But we should know how to deal with outliers. Again, there are multiple ways:
· By using z-score distribution
· By replacing outlier with 1.5th deviation range and in case of extreme outliers we can use 3rd deviation range.
Machine learning algorithm works on mathematical formula so when we pass our dataset that undergone through lots of calculations, so for this out data need to be integer or floating in nature.
But sometimes our dataset contains categorical data which needs to be convert into numeric, but we cannot directly convert them instead we should convert in such a way that they should hold information about dataset.
Categorical data is of two type i.e. ordinal data and nominal data.
Nominal data simply names something without assigning it to an order in relation to other numbered objects or pieces of data. An example of nominal data might be a “pass” or “fail” classification for each student’s test result.
Ordinal data, unlike nominal data, involves some order; ordinal numbers stand in relation to each other in a ranked fashion. For example, suppose you receive a survey from your favorite restaurant that asks you to provide feedback on the service you received. You can rank the quality of service as “1” for poor, “2” for below average, “3” for average, “4” for very good.
So, for both we have different ways of doing encoding.
Ordinal Encoding can be done using
· Label Encoder
· Target guided ordinal category
Nominal Encoding can be done using
· One Hot Encoding
· One Hot Encoding with multiple category
· Mean Encoding
· Replace multiple categories with their count frequency
Most of the data set contains 10+ features and each feature have their unit(kg/cm/$/Celsius) and magnitude(value).
In case of regression type of problem there is concept of gradient descent there we must minimize our loss until global minima does not reach, so if your dataset contain these features with different unit and magnitude then training time may increased and also effect in accuracy.
In case of KNN if we do not do scaling then calculating distance between 2 data point become very complex and time consuming.
So, to overcome that situation we can scale down all out feature to one common scale.
Normalization is a scaling technique in which values are shifted and rescaled so that they end up ranging between 0 and 1. It is also known as Min-Max scaling. It can be done using MinMaxScaler.
Here’s the formula for normalization:
Standardization is another scaling technique where the values are centered around the mean with a unit standard deviation. This means that the mean of the attribute becomes zero and the resultant distribution has a unit standard deviation. It can be done using StandardScaler()
Here’s the formula for standardization:
Handle Imbalance dataset
Suppose we have dataset which has dependent variable as categorical data like 0 and 1 or Yes and No.
In some cases, if 70% of the inputs having their output as 1 and rest as 0. This may lead to case of imbalance data set. When we create our model then it may bias towards 1 and in predicting test data it will provide result as 1.
So, to avoid biasing, there are different ways to handle imbalance dataset.
· Oversampling: Randomly duplicate examples in the minority class
· Undersampling: Randomly delete examples in the majority class
· Using ensemble techniques
Finally, these are steps involved in feature engineering, if any left please give your response.