Predicting a Failure in Scania’s Air Pressure System (APS)

First off, let’s import the required packages and read our training data.

The dataset consists of 171 features, including the class label. Also, in the class label attribute, we will replace ‘neg’ with 0 and ‘pos’ with 1.

The class distribution graph shows a serious case of data imbalance, since out of the total 60,000 training points, about 59,000 points belong to the negative class and just 1,000 points belong to the positive class. We can choose to up-sample the minority class data points, or use a modified classifier to tackle this problem. Also, the percentage of missing data is significantly high in some features (As high as 82% in a feature).

Out of the available features, the ones that have the same value for all data points do not hold much importance in improving performance of our model. Hence, we can discard those features. We can remove the features that have standard deviation = 0.

One of the features, (‘cd_000‘) is seen to have a constant value for all data points. We may remove this feature.

It is always a good practice to identify and replace missing values for each column in your input data prior to modeling your prediction task. This is called missing data imputation, or imputing for short.

We can perform some basic handling of missing data in the following manner:

We will discard features with more than 70% missing values.
For features with missing values less than 5%, we can drop those rows.
For features with missing values between 5–15%, we will impute those missing values using mean/median.
Now for the rest of the features with missing value% between 15–70% missing values, use model based imputation technique.

128 features have less than 5% of it’s values missing, hence we drop the rows consisting of missing values for these features(4027 rows). 7 features (‘br_000’, ‘bq_000’, ‘bp_000’, ‘bo_000’, ‘ab_000’, ‘cr_000’, ‘bn_000′) had more than 70% of it’s values missing. These features have been removed.

The class label has then been separated from our dataset, leaving us with a dataset of shape (55973,162).

14 features had 5% to 15% of it’s values missing and are passed through sklearn’s SimpleImputer and the missing values are imputed using ‘median’. Following which, for features having 15% to 70% missing values, we will perform an Iterative model based imputation technique called MICE. At each step, a feature with missing values is designated as output y and the other feature columns are treated as inputs X. A regressor (we have used Ridge Regressor) is fit on (X, y) for known y. Then, the regressor is used to predict the missing values of y. This is done for each feature in an iterative fashion, and then is repeated for max_iter (10 as default) imputation rounds. The results of the final imputation round are returned.

All the above models are saved and the preprocessing steps are performed on the test dataset.

It was given to us that certain features are histogram bin information, and the prefix (letter before the ‘ _ ‘) is the Identifier and the suffix is the bin_id (Identifier_Bin).

To find the features that are contain histogram bin information, we know that all features from a single histogram have the same prefix.

We can see that there are 7 sets of features having 10 bins each. In other words, there are 7 histograms divided into 10 bins each. eg: Identifier ‘ag’ consists of ag_000, ag_001, ag_002, ag_003, ag_004, ag_005, ag_006, ag_007, ag_008 and ag_009.

The Histogram Identifiers are: [‘ag’, ‘ay’, ‘az’, ‘ba’, ‘cn’, ‘cs’, ‘ee’].

There are 70 features that contain histogram bin information and they are: 
['ag_000', 'ag_001', 'ag_002', 'ag_003', 'ag_004', 'ag_005', 'ag_006', 'ag_007', 'ag_008', 'ag_009', 'ay_000', 'ay_001', 'ay_002', 'ay_003', 'ay_004', 'ay_005', 'ay_006', 'ay_007', 'ay_008', 'ay_009', 'az_000', 'az_001', 'az_002', 'az_003', 'az_004', 'az_005', 'az_006', 'az_007', 'az_008', 'az_009', 'ba_000', 'ba_001', 'ba_002', 'ba_003', 'ba_004', 'ba_005', 'ba_006', 'ba_007', 'ba_008', 'ba_009', 'cn_000', 'cn_001', 'cn_002', 'cn_003', 'cn_004', 'cn_005', 'cn_006', 'cn_007', 'cn_008', 'cn_009', 'cs_000', 'cs_001', 'cs_002', 'cs_003', 'cs_004', 'cs_005', 'cs_006', 'cs_007', 'cs_008', 'cs_009', 'ee_000', 'ee_001', 'ee_002', 'ee_003', 'ee_004', 'ee_005', 'ee_006', 'ee_007', 'ee_008', 'ee_009']

We will select the top features from both the datasets using the complete imputed set. But the Analysis will be performed on the data having missing values.

We will perform EDA on the top 15 features of the histogram dataset. For selecting the features, we will perform Recursive Feature Elimination, using Random Forest Classifier

The top 15 features are :

['ag_001', 'ag_002', 'ag_003', 'ay_005', 'ay_006', 'ay_008', 'ba_002', 'ba_003', 'ba_004', 'cn_000', 'cn_004', 'cs_002', 'cs_004', 'ee_003', 'ee_005']

The PDF, CDF and Box plots of each of these features to try to understand the distribution of our data. The observations made are as follows:

Plots of features ag_003, ay_008, ba_002, ba_003, ba_004, cn_004, cs_002, cs_004, ee_003 and ee_005 show us that the Lower values of the features indicate no failure in the APS component. A higher value clearly indicates an APS component failure

Around 99% values of feature ag_001 and ay_005, where there is no failure in the APS component, are 0.

We can say that in these top features, a higher value may indicate a failure in the truck’s Air Pressure System

But, there are few cases when the values are higher than usual, but still do not lead to APS failure. Example: Feature ee_005

Taking into consideration how each feature is correlated with the target variable (‘class’), we can observe that feature ‘ay_005’ is the most uncorrelated feature among our top attributes. We can perform further Bivariate Analysis on how the other top features vary w.r.t feature ‘ay_005’.

ag_002, ag_001, cn_000: It can be seen from the scatter plot that for any value of the other top features, there is failure in the APS component (class label = 1) when the value in feature ‘ay_005’ is nearly 0.

We will perform EDA on the top 15 features of the histogram dataset. For selecting the features, we will perform Recursive Feature Elimination, using Random Forest Classifier

The top 15 features are :

['aa_000', 'al_000', 'am_0', 'ap_000', 'aq_000', 'bj_000', 'bu_000', 'bv_000', 'ci_000', 'cj_000', 'cq_000', 'dg_000', 'dn_000', 'do_000', 'dx_000']

The PDF, CDF and Box plots of each of these features to try to understand the distribution of our data. The observations made are as follows:

aa_000 : If there is no failure in the APS (class label = 0), about 95% of the points have a value below 0.1x1e6. A higher value than that usually indicates a failure in the APS component.

al_000, am_000 : The values of instances of failure and non-failure of the APS component are not clearly seperable in this feature. Although points of the failure cases do have a slightly higher value.

ap_000, aq_000, bj_000, bu_000 : Instances of failure have a higher value, compared to non-failure cases. But there are few instances of non-failure of the APS component, that see higher values in this feature.

In all features, except dg_000, cj_000, am_0 and al_000, the higher values in the features usually indicate failure in APS component. But due to the Imbalanced nature of the data this may not be certain.

Taking into consideration how each feature is correlated with the target variable (‘class’), we can observe that feature ‘dx_000’ is the most uncorrelated feature among our top attributes. We can perform further Bivariate Analysis on how the other top features vary w.r.t feature ‘dx_000’.

The main observation in all plots here is that for any value of the remaining features, if the feature ‘dx_000’ has a low value ( nearly 0 ), it MAY INDICATE that there is a failure in the APS component (class label=1).

The dataset consists of 60,000 datapoints and 171 features including the class label.
After plotting the count of each class label, we found that out of 60000 points, 59000 points belong to class 0 and the remaining 1000 points belong to class 1. We are working with a highly Imbalanced Binary Classification problem.
We then went forward to check for missing values in our dataset. We observed that some features have more than 70% of their values missing. We decided to remove those features from our dataset. 7 features were thus removed.
There was one feature ( cd_000 ) that had a single value for all data points. And we decided to remove the same, since it will not add much value to our model performance.
For features with less than 5% missing data, the rows consisting of NA values were removed. Features with 5% — 15% missing values were imputed using median. And features with 15% — 70% missing values were imputed using a model based imputation technique.
There are 70 features which consist of bin information from 7 histograms. Each histogram has 10 bins. The Histogram features are the ones which have Identifiers: [‘ag’, ‘ay’, ‘az’, ‘ba’, ‘cn’, ‘cs’, ‘ee’]. The histogram and numerical features were separated into two datasets and we performed Univariate and Bivariate Analysis on the top 15 features of both the datasets.
From performing Recursive Feature Elimination with a Random Forest Classifier, we found our top 15 features from the histogram dataset to be : [‘ag_001’, ‘ag_002’, ‘ag_003’, ‘ay_005’, ‘ay_006’, ‘ay_008’, ‘ba_002’, ‘ba_003’, ‘ba_004’, ‘cn_000’, ‘cn_004’, ‘cs_002’, ‘cs_004’, ‘ee_003’, ‘ee_005’].
Analysis of the features show that in these top features, a higher value may indicate a failure in the truck’s Air Pressure System. But, there are few cases when the values are higher than usual, but still do not lead to APS failure. Example: Feature ee_005. Univariate Analysis on the most uncorrelated feature w.r.t the target variable (ay_005) we saw that for ag_002, ag_001, cn_000 — for any value of these other top features, there is failure in the APS component (class label = 1) when the value in feature ‘ay_005’ is nearly 0.
From performing Recursive Feature Elimination with a Random Forest Classifier, we found our top 15 features from the numerical dataset to be : [‘aa_000’, ‘al_000’, ‘am_0’, ‘ap_000’, ‘aq_000’, ‘bj_000’, ‘bu_000’, ‘bv_000’, ‘ci_000’, ‘cj_000’, ‘cq_000’, ‘dg_000’, ‘dn_000’, ‘do_000’, ‘dx_000’].
From Univariate Analysis, we saw that in all features, except dg_000, cj_000, am_0 and al_000, the higher values in the features usually indicate failure in APS component. But due to the Imbalanced nature of the data this may not be certain. Feature ‘dx_000’ was the most uncorrelated feature among the top features. We performed Bivariate Analysis similar to the histogram top features, and the main observation in all plots here is that for any value of the remaining features, if the feature ‘dx_000’ has a low value ( nearly 0 ), it MAY INDICATE that there is a failure in the APS component (class label=1).

Standardizing a vector most often means subtracting a measure of location and dividing by a measure of scale. For example, if the vector contains random values with a Gaussian distribution, you might subtract the mean and divide by the standard deviation, thereby obtaining a “standard normal” random variable with mean 0 and standard deviation 1. We will scale our data using sklearn’s MinMaxScaler.

A problem with imbalanced classification is that there are too few examples of the minority class for a model to effectively learn the decision boundary. One way to solve this problem is to oversample the examples in the minority class.

The combination of SMOTE and under-sampling performs better than plain under-sampling.

Finally, we have 33,226 points belonging to negative class and 16,613 points belonging to the positive class. We will pass our scaled dataset through the linear models (Logistic Regression and Support Vector Machines).

Now that we have prepared our performed our EDA, data-preprocessing and feature engineering, let’s move on to modelling. We will pass our data through various models, perform hyperparameter tuning, and evaluate each of them based on our performance metric (Macro-F1 Score) and Confusion Matrix. The different models that we will be trying out here are Logistic Regression, Support Vector Machines, Naive Bayes, Decision Trees, Random Forest, Gradient Boosted Decision Trees, Adaboost Classifier and a Custom Ensemble.

As a Baseline Model, we will predict all class labels to be 0 (majority class) and calculate the F1 score for the same. We can use sklearn’s DummyClassifier to obtain our baseline results.

For our Custom Ensemble:

Split the train set into D1 and D2 (50–50).
From D1, perform sampling with replacement to create d1 , d2 , d3 …. dk (k samples).
Now, create ‘k’ models and train each of these models with each of these k samples.
Pass the D2 through each of these ‘k’ models, which gives us ‘k’ predictions for D2, from each of these models.
Using these ‘k’ predictions create a new dataset, and for D2, since we already know it’s corresponding target values, we can now train a meta model with these ‘k’ predictions as features.
For model evaluation, we will pass our test set through each of the base models and get ‘k’ predictions. Then we can create a new dataset with these ‘k’ predictions and pass it to the previously trained metamodel to get our final prediction.
Now using this final prediction as well as the targets for the test set, we can calculate the models performance score.

We can use Decision Trees as our base model and GBDT as the metamodel. This is a custom implemented model.

After performing hyperparameter tuning and experimenting various models, we see that the Gradient Boosted Decision Trees works best as it gets the highest Macro-F1 Score (as seen below).

Summary of Modelling

The model can be deployed on our local server using Flask API. Code for the same includes loading the required models, creating Pandas Dataframe from the .csv file taken from the specified path, and storing the final output in a .csv file at the output directory.

The HTML code for the same is given below

HTML CODE

On running the above code, the html page on our local server would look something like this, where you can specify the path to your input file and output directory:

HTML page to specify paths

The output directory would consist of a .csv file (consisting of timestamp) which consists of the preprocessed dataset along with the model predictions.

We can conclude that Gradient Boosted Decision Trees perform well when applied to data with a combination of median and MICE imputation. The result are good after performing the necessary feature engineering methods. I hope this project gives you a fair idea on how to approach any data science project especially if you’re just starting out 🙂

You can view the entire Code at my Github. And feel free to contact me through LinkedIn or Twitter.

Deep Learning methods may be used to solve this particular problem, and we can evaluate the Neural Network using our performance metric.
Various other imputation methods can be used such as Soft-Impute Algorithm.

Kaggle: https://www.kaggle.com/uciml/aps-failure-at-scania-trucks-data-set
IDA 2016 Industrial Challenge: Using Machine Learning for predicting Failures: https://link.springer.com/chapter/10.1007/978-3-319-46349-0_33
Applied AI Course: https://www.appliedaicourse.com/
Machine Learning Mastery: https://machinelearningmastery.com/

Footer