Categorial Data Tips

This is the case in which you have a column with categorical values and which are easy to discretize, for example:

In this case, using the python programming language, the task does not become complicated.

import pandas as pddata = {'row_2': ['a', 'b', 'c', 'd','a', 'b']}df = pd.DataFrame.from_dict(data)df2 = pd.get_dummies(df)print(df2)>>>
row_2_a  row_2_b  row_2_c  row_2_d
0        1        0        0        0
1        0        1        0        0
2        0        0        1        0
3        0        0        0        1
4        1        0        0        0
5        0        1        0        0

In this case, to use some machine learning model, we won’t have so much trouble.

The second case of the figure above, does not necessarily have to be applied, I just wanted to present another alternative. The problem with its use is that when applying some machine learning model, it can add some weight to the parameters.

In this case it is always possible to have an infinite number of columns, and it can make our model poorly performing during training. Its biggest problem is that it can make our dataframe have more columns than rows.

import pandas as pddata = {'row_2': ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']}
df = pd.DataFrame.from_dict(data)df2 = pd.get_dummies(df)print(df2)

This example above is difficult to test, and can be much worse.

This case is an addendum to case 2, because we can have an infinite number of columns, and even one column, can have more information than necessary. But, the only problematic step, in this case, is to split the column, and then we go into case 2.

import pandas as pd
data = {'row_2': ['a,b', 'b,e', 'c', 'd','a', 'b']}df = pd.DataFrame.from_dict(data)df['row_2'].str.get_dummies(sep=',')
>>>   a  b  c  d  e
0  1  1  0  0  0
1  0  1  0  0  1
2  0  0  1  0  0
3  0  0  0  1  0
4  1  0  0  0  0
5  0  1  0  0  0

This pythonic solution is very elegant.

This is a problem for example with age.

import pandas as pd
data = {'age': [10,12,11,20,30,33,31,60,2,40,70]}df = pd.DataFrame.from_dict(data)

You have very close ages, which do not differ much, so the interesting thing is to apply an interval.

df['age_interval'] = pd.cut(df.age, [0,10,18,30,50,70], include_lowest=True)>>>
age    age_interval
0   10  (-0.001, 10.0]
1   12    (10.0, 18.0]
2   11    (10.0, 18.0]
3   20    (18.0, 30.0]
4   30    (18.0, 30.0]bins = [0, 10, 18, 25, 35]
labels = ["children","teen","adult","old"]
df['age_label'] = pd.cut(df['age'], bins=bins, labels=labels)
>>>
age age_label
0   10  children
1   12      teen
2   11      teen
3   20     adult
4   30       old

This case of interval, can be used for other problems, and will help you not to load too many variables.

When faced with these cases, the best solution is to use the python dummy variables ( exemplified above ). But, we must be careful when we have an exaggeration of the columns in relation to the number of rows , in this case the recommended thing is to check which categorical columns are really important.

By making categorical variables numerical, the use in some machine learning model will become more performative.

Footer