
The objective of this tutorial is to provide a hands-on experience to CatBoost regression in Python. In this simple exercise, we will use the Boston Housing dataset to predict Boston house prices. But the applied logic on this data is also applicable to more complex datasets.
So let’s get started.
First, we need to import the required libraries along with the dataset:
import catboost as cb
import numpy as np
import pandas as pd
import seaborn as sns
import shap
import load_boston
from matplotlib import pyplot as pltfrom sklearn.datasets
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
from sklearn.inspection import permutation_importanceboston=load_boston()boston = pd.DataFrame(boston.data, columns=boston.feature_names)
Data exploration
It is always considered good practice to check for any Na values in your dataset, as it can confuse or at worst, hurt the performance of the algorithm.
boston.isnull().sum()
However, this dataset does not contain any Na’s.
The data exploration and feature engineering phase are some of the most crucial (and time-consuming) phases when making data science projects. But in this context, the main emphasis is on introducing the CatBoost algorithm. Hence, if you want to dive deeper into the descriptive analysis, please visit EDA & Boston House Cost Prediction [4].
Training
Next, we need to split our data into 80% training and 20% test set.
The target variable is ‘MEDV’ — Median value of owner-occupied homes in $1000’s.
X, y = load_boston(return_X_y=True)X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=5)
In order to train and optimize our model, we need to utilize CatBoost library integrated tool for combining features and target variables into a train and test dataset. This pooling allows you to pinpoint target variables, predictors, and the list of categorical features, while the pool constructor will combine those inputs and pass them to the model.
train_dataset = cb.Pool(X_train, y_train)
test_dataset = cb.Pool(X_test, y_test)
Next, we will introduce our model.
model = cb.CatBoostRegressor(loss_function=’RMSE’)
We will use the RMSE measure as our loss function because it is a regression task.
In situations where the algorithms are tailored to specific tasks, it might benefit from parameter tuning. The CatBoost library offers a flexible interface for inherent grid search techniques, and if you already know the Sci-Kit Grid Search function, you will also be familiar with this procedure.
In this tutorial, only the most common parameters will be included. These parameters include a number of iterations, learning rate, L2 leaf regularization, and tree depth. If you want to discover more hyperparameter tuning possibilities, check out the CatBoost documentation here.
grid = {'iterations': [100, 150, 200],
'learning_rate': [0.03, 0.1],
'depth': [2, 4, 6, 8],
'l2_leaf_reg': [0.2, 0.5, 1, 3]}model.grid_search(grid, train_dataset)
Performance evaluation
We have now performed the training of our model, and we can finally proceed to the evaluation of the test data.
Let’s see how the model performs.
pred = model.predict(X_test)
rmse = (np.sqrt(mean_squared_error(y_test, pred)))
r2 = r2_score(y_test, pred)print(“Testing performance”)
print(‘RMSE: {:.2f}’.format(rmse))
print(‘R2: {:.2f}’.format(r2))
As depicted above we achieve an R-squared of 90% on our test set, which is quite good, considering the minimal feature engineering.