Many machine learning algorithms are sensitive to the range and distribution of attribute values in the input data.
Outliers in input data can skew and mislead the training process of machine learning algorithms resulting in longer training times, less accurate models and ultimately poorer results.
Even before predictive models are prepared on training data, outliers can result in misleading representations and in turn misleading interpretations of the collected data. Outliers can skew the summary distribution of attribute values in descriptive statistics like mean and standard deviation and in plots such as histograms and scatterplots, compressing the body of the data.
Outliers are extreme values that fall a long way outside of the other observations. For example, in a normal distribution, outliers may be values on the tails of the distribution.
The process of identifying outliers has many names in data mining and machine learning such as outlier mining, outlier modelling and novelty detection and anomaly detection.
In his book Outlier Analysis, Aggarwal provides a useful taxonomy of outlier detection methods, as follows:
- Extreme Value Analysis: Determine the statistical tails of the underlying distribution of the data. For example, statistical methods like the z-scores on univariate data.
- Probabilistic and Statistical Models: Determine unlikely instances from a probabilistic model of the data. For example, gaussian mixture models optimized using expectation-maximization.
- Linear Models: Projection methods that model the data into lower dimensions using linear correlations. For example, principle component analysis and data with large residual errors may be outliers.
- Proximity-based Models: Data instances that are isolated from the mass of the data as determined by cluster, density or nearest neighbour analysis.
- Information Theoretic Models: Outliers are detected as data instances that increase the complexity (minimum code length) of the dataset.
- High-Dimensional Outlier Detection: Methods that search subspaces for outliers give the breakdown of distance based measures in higher dimensions (curse of dimensionality).
There are many methods and much research has been put into outlier detection. Lets work through the step by step process from extreme value analysis, proximity methods and projection methods.
We will start out with the simple extreme value analysis.
- Focus on univariate methods;
- Visualize the data using scatterplots, histograms and box and whisker plots and look for extreme values;
- Assume a distribution (Gaussian) and look for values more than 2 or 3 standard deviations from the mean;
- Filter out outliers candidate from training dataset and assess your models performance.
Once you have explored the simpler extreme value methods, consider moving onto proximity-based methods.
- Use clustering methods to identify the natural clusters in the data (such as the k-means algorithm);
- Identify and mark the cluster centroids;
- Identify data instances that are a fixed distance or percentage distance from cluster centroids;
- Filter out outliers candidate from training dataset and assess your models performance.
Projection methods are relatively simple to apply and quickly highlight extraneous values.
- Use projection methods to summarize your data to two dimensions (such as PCA, SOM mapping);
- Visualize the mapping and identify outliers by hand;
- Use proximity measures from projected values to identify outliers;
- Filter out outliers candidate from training dataset and assess your models performance.
An alternative strategy is to move to models that are robust to outliers. There are robust forms of regression that minimize the median least square errors rather than the mean but are computationally intensive. There are also methods like decision trees that are robust to outliers.