An outlier is considered any data point which is too far from the mean value of its corresponding random variable. Generally speaking, those points produce terrible results and don’t have any significant value to offer to the training procedure, therefore there are certain measures to be taken in order to avoid that:
If their number is low, we can get rid of them.
Else, the engineer should choose cost functions which are immune to those outliers.
For example, Least Square Error is not really that immune, since squaring the outliers will result in even larger errors, thus their dominance inside the cost function.
Variability in a dataset equals different feature scales, since one or many of them might adhere to different scaling ranges, and such concept quickly translates into dominance of larger values inside the cost function compared to those with a smaller bandwidth, as stated previously. Therefore, the most normal step to take is to proceed to flatten out the total range of all values so that they manifest themselves into a single common reference point. Correspondingly, this means to establish a normal distribution, where the mean of datapoints will be 0 and standard deviation equal to 1.
It should be worth it to note that this procedure is one among the other linear ones, just like the [-1, 1] scaling. Subsequently, one can make use of non-linear scaling functions like Softmax.
Since we are talking about real world solutions, it’s normal to expect a percentage of the total data to be missing, a phenomenon well observed in social sciences or prognostic medical datasets. So, what’s that we do? Essentially, the solution is named Imputation can be divided into three possible candidates:
Replace missing values with zeroes,
replace them with a conditional mean value {E (missing | observed)}, or
replace missing values with a the non-conditional mean (calculated via the available observed values).
Of course, a simple way of dealing with this is by simply getting rid of them, but this might cause problems when the dataset is not sufficiently large in order to proceed to such drastic measures, thus the reduction of the resulted extracted information.