Introduction and Performance Comparison of Various Outlier Detection Models
Anomaly or outlier detection is the process of identifying data points, observations, or events that deviate from normal behaviours or distribution in datasets. Anomalous data can indicate potential critical incidents, such as fraudulent transactions, network intrusion, technical failure, etc. In contrast to standard classification or prediction tasks, anomaly detection is often applied on unlabelled dataset, taking only the internal structure and correlation of the dataset into account.
Numerous machine learning models are suitable for outlier detection. However, supervised models are more constraining than unsupervised models as they need to be provided with labelled datasets. This requirement is particularly expensive when the labelling must be performed by humans. Dealing with a heavily imbalanced class distribution, which is inherent to outlier detection, can also affect the efficiency of supervised models. This article focuses on unsupervised machine learning models to isolate outliers from nominal samples.
2. Unsupervised Outlier Detection Models
This section briefly introduces several well-known and effective unsupervised outlier detection models.
2.1 Isolation Forest (IF) model uses random forests to compute an isolation score for each data point. The model is built by performing recursive random splits on attribute values, hence generating trees able to isolate any data point from the rest of the data. The score of a point is then the average path length from the root of the tree to the node containing the single point, a short path denoting a point easy to isolate due to attribute values significantly different from nominal values.
2.2 Kullback-Leibler Divergence (KLD) model is used as an information theoretic measure for outlier detection. The model first trains a Gaussian mixture model on a training dataset, then estimates the information content of new data points by measuring the KL divergence between the estimated density and the density estimated on the training dataset and the new data points. This reduces to an F-test in the case of a single Gaussian.
2.3 Angle-Based Outlier Detection (ABOD) model uses the radius and variance of angles measured at each input vector instead of distances to identify outliers. The motivation is here to remain efficient in the high-dimensional space and to be less sensible to the curse of dimensionality. Given an input point x, abod samples several pairs of points and computes the corresponding angles at x and their variance. Broad angles imply that x is located inside a major cluster as it is surrounded by many data points, while small angles denote that x is positioned far from most points in the dataset. Similarly, a higher variance will be observed for points inside or at the border of a cluster than for outliers.
2.4 One-Class SVM (OCSVM) model is an application of Support Vector Machine (SVM) algorithm to one-class problems. The model computes a separating hyperplane in a high-dimensional space induced by kernels performing dot products between points from the input space in the high-dimensional space. The boundary is fitted to the input data by maximizing the margin between the data and the origin in the high-dimensional space. The model prevents overfitting by allowing a percentage of data points to fall outside the boundary. These data points act as a regularization parameter. They are used as a lower bound on the fraction of support vectors delimiting the boundary and as an upper bound on the fraction of margin errors, i.e. training points remaining outside the boundary.
2.5 Local Outlier Factor (LOF) model detects outliers by measuring the local deviation of a given data point with respect to its neighbours. LOF shares some concepts with DBSCAN and OPTICS such as the concepts of “core distance” and “reachability distance”, which are used for local density estimation. The local outlier factor is based on a concept of a local density, where locality is given by k nearest neighbors, whose distance is used to estimate the density. By comparing the local density of an object to the local densities of its neighbors, one can identify regions of similar density, and points that have a substantially lower density than their neighbors. These are considered to be outliers.
3. Performance Evaluation
Several experiments are performed to assess and compare the performance of aforenamed 5 models. The experiments use 10 publicly available labelled datasets that are recommended for outlier detection. 75% of the data points are used as training dataset and the rest 25% of the data points are used as unseen dataset for testing. The 5 models with default parameters are applied to the 10 datasets. Average metrics are calculated for each model and given below.
3.1 Area under ROC Curve and Precision-Recall (PR) Curve
The area under ROC curve and Precision-Recall (PR) curve are used as metrics to evaluate and compare model outlier detection performance on testing dataset.
The class distribution of input datasets is heavily imbalanced where the outlier class is the positive class. For this kind of problems where the positive class is more interesting than the negative class, precision-recall curves show to be particularly useful. Indeed, the use of precision strongly penalizes methods raising false positives even if those represent only a small proportion of the negative samples.
It can be seen that Isolation Forest and One-Class SVM models show excellent average area under curves and achieve the best outlier detection results which makes these two models a reliable choice for outlier detection.
3.2 Model Complexity and Training Time
Average training time of 5 different models is evaluated when increasing the dataset size.
The number of samples has a strong impact on the training time of KLD, LOF, and OCSVM models which scale poorly. IF and ABOD models show a very good training time scaling for increasing number of samples, along with a very small base computation time.
This article demonstrates that IF model is an excellent method to efficiently identify outliers while showing an excellent scalability on large datasets along with an acceptable training time. OCSVM model is also a good candidate but it is not suitable for large datasets.