Data Preprocessing for Machine Learning Part II

Welcome to the second section of data preprocessing for machine learning. Do check out the first part for better understanding. We will be learning about the following sections in this blog.

Normalization
Binarization
Encoding Categorical (Ordinal & Nominal) Features
Imputation
Polynomial Features

Normalization is the process of reducing the data size for the better performance of the model. It centralizes the data with respect to the origin. Each of the parameter value is obtained by dividing by the magnitude.

Let us take an example. We will be generating the data and plotting it.

#Generate Data
df = pd.DataFrame({
'x1' : np.random.randint(-100,100,10000).astype(float),
'y1' : np.random.randint(-70,70,10000).astype(float),
'z1' : np.random.randint(-150,150,10000).astype(float),
})
fig = plt.figure()
ax = plt.axes(projection= '3d')
ax.scatter3D(df.x1, df.y1, df.z1)

Now we will be importing the Normalizer model from sklearn.preprocessing and fit it to our model. Then after implementing the model the data will look like.

from sklearn.preprocessing import Normalizer
model = Normalizer()
data_tf = model.fit_transform(df)
df = pd.DataFrame(data_tf,columns = ['x1','y1','z1'])
ax = plt.axes(projection='3d')
ax.scatter3D(df.x1, df.y1, df.z1)

The data is centralized with the origin and also reduced into the scale of -1 to 1. This saves a lot of storage and make the processing and training faster with higher efficiency. You can get the full code here.

Binarization is the process to threshold the values to 0 or 1. These type of technique is highly applicable for problems like classification where we have two categories. Let us assume that the our data is from 1to 10 and this technique will classify values from 1 to 5 as 0 and 5–10 as 1. Binarization simply improves speed as the values are binarized. Lets us generate the data.

#Generating Data
X = np.array([[1, -1, 2],
[2, 0, 0],
[0, 1, -1]])

Now we will be using binarization and see the output.

from sklearn.preprocessing import Binarizer
binarizer = Binarizer()data_tf = binarizer.fit_transform(X)
print(data_tf)

Output : [[1 0 1] [1 0 0] [0 1 0]]

In the output section the data is binarized into 0 and 1. Get Full Code Here.

a) Encoding Ordinal Values

Ordinal values are those values having relationship between them. They can be called as label Encoding. Simply we can say the income of a person as an ordinal values. Those values have relationships like high, low, medium. These types of values needed to be converted into numeric categories like 0,1,2. This is an example of multiple classification. Let us generate the data and start coding.

d#Generate Data
df = pd.DataFrame({
'Age' : [33,44,22,44,55,22],
'Income' : ['Low','Low','High','Medium','Medium','High']
})
df

Now those values will be converted into category using simple mapping technique from dictionary.

df.Income.map({'Low' : 1,'Medium': 2, 'High' : 3})

Those data are converted into different categories, 1 for low, 2 for medium and 3 for high. Get Full Code Here.

b)Encoding Nominal Values

This is also similar classification technique but the data do not have relationships between them. We classify the data into different category like Gender. In Gender we have categories like Male and Female. There is no relationship like if this is male then that is female.

Let’s look with an example.

df = pd.DataFrame({
'Age' : [33,44,22,44,55,22],
'Gender' : ['Male','Female','Male','Female','Male','Male']
})
df.Gender.unique()

We have two data in dictionary and in gender we have two unique category. Now we will categorized them using LableEncoder available in sklearn.

from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df['gender_tf'] = le.fit_transform(df.Gender)
df

We can see that the data are categorized as 0 and 1. Male is categorized as 1 and Female as 0. Get Full Code Here.

Missing values cannot be processed by learning algorithms. So we need to either remove the data or fill the data with some values. To do so we need to use imputation. Imputation uses the different strategy to fill the gaps between the data. It is better to remove the data only if all row elements are null else we can use imputation. Imputers can be used to infer value of missing data from existing data.

Let’s generate the data and impute them.

df = pd.DataFrame({
'A':[1,2,3,4,np.nan,7],
'B':[3,4,1,np.nan,4,5]
})
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
imputer.fit_transform(df)

Here, we have selected the null values and then use the mean value to replace the missing values. This is not an accurate method to use for small data as they can be biased but can be used for large data. Get Full Code Here.

This type of preprocessing technique is only used for polynomial regression to learn model of higher degree. In Polynomial Features the data is converted into the higher degree so to get the required relationship between them. We will be looking towards the example.

Let’s generate the data.

df = pd.DataFrame({'A':[1,2,3,4,5], 'B':[2,3,4,5,6]})
print(df)

Here, the data have the relationship but in the lower degree. We can convert the data so that they cam have the relationship in higher degree.

from sklearn.preprocessing import PolynomialFeatures
pol = PolynomialFeatures(degree=2)
pol.fit_transform(df)

The data is scaled for polynomial degree of 2. Get Full Code Here. This much for today. Get more blogs about Machine Learning.

a) Encoding Ordinal Values

b)Encoding Nominal Values

Footer