Regression modelling with LinearSVR
Specifically, we will observe how to use the LinearSVR to fit a hyperplane across the observations in questions as shown in the image above, and fit as many instances as is feasible within this hyperplane while at the same time limiting margin violations. In this regard, the LinearSVR is the regression-based equivalent of the LinearSVC class.
For this purpose, we will analyse a simple correlation plot of the features to determine which ones to include in the model.
When looking at a correlation plot for the variables in question, we can see that Outcome (whether the person is diabetic or not), Glucose, and Skin Thickness demonstrate relatively stronger correlations with BMI (which is the outcome variable in this instance).
That said, we see that the Outcome and Glucose variables show a correlation of 0.49. This indicates that the variables may be multicollinear — i.e. they are both explaining the same thing (in this case, that the person is diabetic), and thus including both may be redundant.
In this case, we will include Glucose and Skin Thickness as the two features in modelling BMI.
The variables are loaded and a train-test split is conducted:
y1 = np.array(bmi)x1 = np.column_stack((skinthickness, glucose))
x1 = sm.add_constant(x1, prepend=True)X_train, X_val, y_train, y_val = train_test_split(x1, y1)
Now, the LinearSVR models are defined with varying epsilon values.
from sklearn.svm import LinearSVRsvm_reg_0 = LinearSVR(epsilon=0)
svm_reg_05 = LinearSVR(epsilon=0.5)
svm_reg_15 = LinearSVR(epsilon=1.5)svm_reg_0.fit(X_train, y_train)
Predictions are generated using the validation data:
predictions0 = svm_reg_0.predict(X_val)
predictions05 = svm_reg_05.predict(X_val)
predictions15 = svm_reg_15.predict(X_val)
RMSE (root mean squared error) values are generated by comparing the predictions to the validation set.
>>> mean_squared_error(y_val, predictions0)
>>> math.sqrt(mean_squared_error(y_val, predictions0))6.776059607874521>>> mean_squared_error(y_val, predictions05)
>>> math.sqrt(mean_squared_error(y_val, predictions05))8.491111246123179>>> mean_squared_error(y_val, predictions15)
>>> math.sqrt(mean_squared_error(y_val, predictions15))5.905569225428098
We can see that the lowest RMSE is obtained when the epsilon value is set to 1.5. However, this is only marginally lower than that obtained when epsilon is set to 0. According to the sklearn documentation as referenced above, the value of epsilon is dependent on the scale of the data. In case of any doubt, this should be left at 0.
In this regard, an epsilon of 0 will be used in generating the predictions on the test set and analysing the subsequent RMSE.
A portion of the data was held out from the original dataset used to train the LinearSVR.
The model is now used to generate predictions on the held-out feature data, and compare the predictions to the unseen BMI values.
atest = np.column_stack((t_skinthickness, t_glucose))
atest = sm.add_constant(atest, prepend=True)t_bmi = h3data['BMI']
btest = t_bmi
btest=btest.valuesbpred = svm_reg_0.predict(atest)
An RMSE value of 7.38 is generated, which is roughly 22% of the size of the test mean at 32.82. For this particular dataset, it is quite unlikely that the other features (including the ones that were ultimately dropped from the analysis) can account for all the variation in BMI. Lifestyle factors such as calories consumed per day, movement per day, among others would likely have a significant impact on the overall BMI figure.
In this regard, we can judge that the features identified have done a reasonably good job at explaining much of the variation in BMI.
In this example, you have seen how a LinearSVR can be used to solve regression problems. Specifically, we saw:
- How correlation plots can aid in feature selection
- Configuration of a LinearSVR and the role of epsilon
- How to test model accuracy using RMSE
As we have seen, while the model performed reasonably well in estimating BMI values, it is also important to account for the fact that certain features that would be of importance in influencing BMI values are not included in the dataset. Therefore, given the absence of data, there is a limit to which the LinearSVR (or indeed any model) can maximise accuracy given a lack of available data.
Many thanks for your time, and any questions or feedback are greatly welcomed.
Disclaimer: This article is written on an “as is” basis and without warranty. It was written with the intention of providing an overview of data science concepts, and should not be interpreted as professional advice in any way.