In recent years, anomaly detection has gained great importance and became one of the developing areas both in academia and industry. Anomaly detection techniques are commonly used in
- Safety requirements such as fraud detection
- Health condition monitoring
- Network security
- Data leakage prevention
- Surveillance systems.
In generic terms an outlier or anomaly refers to any sample that does not follow the normal pattern from the rest of data points and is easy to separate. But depending on the context you should define the normal and not be too generic, in order not to assign every data point in your dataset as anomaly.
Types of anomalies
Anomalies can be point, collective, and contextual anomalies. Below is the concise definitions of these.
Point anomalies – are single anomalous instances in dataset.
Collective anomalies – are collection of data points that exhibit the abnormal behaviour.
Contextual anomalies – are instances in specific context that also exhibit abnormal behaviour.
Methods of Anomaly Detection
Proposed methods for anomaly detection mainly consist of two parts: feature extraction and model development. To extract good and robust features from raw data is a crucial step to solve any machine learning problems.
Let’s take an example of using anomaly detection to detect Bots for an E-Commerce website. The problem can be essentially described as anomaly detection in users behavior. Hence, we need to analyze and find relationships between features in the data which may be linear, non-linear or very complex.
Depending on the dataset at hand you may want to use following features like:
- Non-linear Click Behaviour of the User
- Number of accounts being created in a short timespan (from the same or similar IP addresses)
- Number of requests for certain item (from the same or similar IP addresses)
- Sessions with unusual large amount views
- Geolocation of the user (whether it is in the operating region of your enterprise)
- Purchase history
After analyzing the data with Exploratory Data Analysis (EDA), you can come up with new features and utilize them. As a rule of thumb, choose features that take too high or too low values.
Types of algorithms
There is a plethora of algorithms (which can be supervised or unsupervised) to solve this problem, e.g. random forests, Support Vector Machines (SVMs), K-nearest neighbours (KNNs), and Convolutional Neural Networks (CNNs).
In general, there are limited or no data for anomalies. Therefore, unsupervised based algorithms have gained …
Unsupervised learning using autoencoders
We will briefly describe the case of unsupervised learning using autoencoders for anomaly detection. This approach will help you to learn the inherent data characteristics to separate the normal data points from the anomalous ones. It is more cost effective technique to find the anomalies since it does not require annotated data for training the algorithms. But as far as the no free lunch theorem goes, it has its disadvantages too and you should bear these points in mind. Often it is challenging to learn commonalities within data in a complex and high dimensional space. While using the autoencoders, degree of compression, i.e., dimensionality reduction is often a hyper-parameter that requires tuning for optimal results, luckily newly open sourced AutoML from Google can be of use to ease the task.
In addition to that unsupervised techniques are very sensitive to noise and data corruption, hence make sure to preprocess your dataset thoroughly. After you are done with feature engineering and feature selection, you should split your data into folds and build your model using Autoencoders.
Use of Autoencoders
Using Autoencoders to detect anomalies is very intuitive, Autoencoders are special type of neural networks that basically compress the input into a lower-dimension and then reconstruct the output from this representation. So when an abnormal data point appears in the model, the reconstruction error or the clustering in the latent space should be noticeable due to the unusual characteristics of the data. Examples of Autoencoders are
Long short-term memory (LSTM)- based Variational Autoencoders (VAEs) and Multi-Channel CNN or LSTM based encoder-decoders.
Metrics
Choosing the right metric to evaluate your models is of utmost importance in any machine learning project. Hence, you must pay attention not only to accuracy but also to metrics like: TN, TP, FN, TN, Precision, Recall, AUC/ROC, LIFT, and F1-Score.
These were quick tips for using unsupervised learning for anomaly detection. There are many conducive fields to apply anomaly Detection. Anomalies in your projects can be summative and eventually lead to big financial losses over time. Reach out to Datamics we will be happy to help you.