Regression analysis may be defined as a type of predictive modelling technique which investigates the relationship between a dependent (target) and independent variable(s) (predictor). This technique is used for forecasting, time series modelling and finding the cause-effect relationship between the variables.
Regression analysis may be considered a reliable method of identifying the variables that have impact on a topic of interest. In the final part of the study, I tried to determine which factors matter most and which factors can be ignored. When I used only the most important (having the highest correlation to the target) parameters (4 out of 8), the results were still quite satisfactory.
Evaluating the model accuracy is an essential part of the process in evaluating the performance of machine learning models to describe how well the model is performing in its predictions. Evaluation metrics change according to the problem type. The errors represent how much the model is making mistakes in its prediction. The basic concept of accuracy evaluation is to compare the original target with the predicted one according to certain metrics. The below metrics are mainly used to evaluate the prediction error rates and model performance in regression analysis.
A more clear definition (at least for me) is: “The coefficient of determination, R2, is used to analyze how differences in one variable can be explained by a difference in a second variable. More specifically, R-squared gives you the percentage variation in y explained by x-variables.” Most often than not (unless there are too many variables), R-squared is used for evaluating the model performance.
A little domain knowledge; more tons of concrete are produced and used than any other technical material. This is due to its low cost and widely available raw materials (water, slag, cement…). Also, cement is the bonding material in the concrete. Concrete strength (target value) is affected by many factors, such as quality of raw materials, water/cement ratio, coarse/fine aggregate ratio, age of concrete etc.
This article involves the application of 5 different machine learning algorithms, with the aim of comparing their performances for regression analysis. The machine learning algorithms that I used for regression are:
1. Linear Regression,
2. Lasso Regression,
3. Ridge Regression,
4. Random Forest Regression,
5. XGBOOST Regression.
I tried to concentrate on the following issues:
- Which regression algorithm provided the best results?
- Is Cross Validation effective in increasing the regression performance?
- Results of determining the important features using only those parameters on the performance of the algorithm.
I used the Kaggle dataset. As explained on the website, the first seven parameters are the addition to the concrete (units in kg in a m3 mixture and “Age” in days (1 to 365)). There are 8 quantitative input variables, and 1 quantitative output variable (Compressive Strength (in MPa) of the concrete); 1030 instances with no missing data.
The first step is to import and clean the data (if needed) using pandas before starting the analysis.
According to the dataset, there were 1030 different prepared concretes and no missing values.
I also checked the outliers, which may cause negative effects on the performance of the analysis. The boxplots below show that there were not many outliers, so I moved to the distribution of the values.
Cement, water, coarse aggregate and fine aggregate have the highest (absolute valued) correlation coefficients and their use in the concrete were close to normal distribution. In many concrete mixtures, slag, fly ash and superplasticizer were not used at all. For the distribution pattern of these three parameters, a reasonable explanation may be that, two distributions were combined in the data, giving the appearance of a bimodal distribution. Also, aging time was kept to a minimum (right-skewed).
The dataset is clean (there are no NaNs, Dtypes are correct), so we will directly start by checking the correlation of the parameters with the target value — csMPa (Compressive Strength of concrete). The figure below shows that cement and superplasticizer have the highest positive correlation coefficient values and water has the highest negative correlation coefficient value.
I used the below Python function for the prediction of the errors.
I started by using linear regression in order to model the relationship between the features and the target variable. MAE and RMSE results were in terms of MPa (unit for the compressive strength) and they tell us that, the difference between the observed and predicted results are ± 7.74 (MAE) and ± 9.79 (RMSE) MPa. R2-score of 0.6275 is not a very high value and shows that even noisy, high-variability data can have a significant trend. The trend indicates that the predictor variable still provides information about the response even though data points fall further from the regression line.
Ridge regression is defined as a model tuning method that is used to analyze any data that suffers from multicollinearity. Multicollinearity refers to a situation in which two or more explanatory variables in a multiple regression model are highly linearly related. When the issue of multicollinearity occurs, least-squares are unbiased, and variances are large, this results in predicted values to be far away from the actual values. The heatmap below shows the highest relationship is between water and superplasticizer (- 0.66) and this may be the reason for the low r2_score of the linear regression analysis.
Ridge regression analysis results are given as:
Ridge regression uses α (alpha) as the parameter which balances the amount of emphasis given to minimizing RSS vs minimizing sum of square of coefficients. α can take various values. For Ridge Cross Va;idation, I defined an array of 100 possible alpha values between 0.1 and 20; the optimum value turned out to be 0.1. When I fit the model again using 0.1 as the alpha, there was a huge increase in the r2_score — from 0.45 to 0.60.
Lasso regression is a type of linear regression that uses shrinkage. Shrinkage is where data values are shrunk towards a central point, like the mean. Like Ridge regression, Lasso is well-suited for models showing high levels of multicollinearity or when you want to automate certain parts of model selection, like variable selection/parameter elimination. As mentioned in the Ridge Regression section, there is multicollinearity in this data and using Lasso Regression may improve the results.
Lasso Cross Validation analysis results for the optimum alpha value of 0.1 and 0.0001 are given as:
The jump in the r2_score is very high from 0.44 to 0.62, when the alpha value becomes almost zero. This shows us that, it is impossible to get an improvement in the r2_score with the Lasso regression model.
Random Forest Analysis
Wikipedia definition is: “Random forests or random decision forests are an ensemble learning method (multiple learning algorithms) for classification, regression and other tasks that operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean/average prediction (regression) of the individual trees.”
Random Forest analysis results display a huge increase in the r2_score:
Cross Validation mean score is almost 0.90:
Next step was the evaluation of the importance of the features on the analysis:
I will consider the first 4 features (cement, age, water and slag) for further analysis. After getting new dataframes including only the important features and splitting the dataset, the results are as follows:
What I got here is, I used only 4 features (instead of 8)- only the ones considered as the most important and the drop of r2_score was from 0.8995 to 0.8712. In our case, there were only 1030 data, but while working with huge datasets, working with less features will save a lot of time, with a little loss of accuracy.
XGBoost (Extreme Gradient Boosting) is an optimized distributed gradient boosting library and it is used for supervised ML problems. XGBoost belongs to a family of boosting algorithms that convert weak learners into strong learners.
I used XGBoost and got the best results as shown:
Once again, I used the important features function and selected the 4 most important parameters. The drop in r2_score was infinitesimal — from 0.9300 to 0.9234:
In this article, I used 5 regression algorithms and some key facts associated with each technique. The analysis provides evidence that:
- XGBoost algorithm provided the best results, followed by the Random Forest algorithm.
- Linear, Lasso and Ridge regression algorithms’ results were quite close.
- After cross validation, there was a huge increase in the accuracy of the Lasso and Ridge regressions.
- Using only the most important 4 features (out of 8) resulted in a little decrease in the accuracy of the results, with a high potential in saving time for large datasets.
You can access to this article and similar ones here.