
- Introduction
- CatBoost
- Short Tutorial
- Summary
- References
With Data Science competitions becoming more and more popular, especially on the site, Kaggle, so are new Machine Learning algorithms. Whereas XGBoost was the most competitive and accurate algorithm most of the time, a new leader has emerged, named CatBoost. There has been an open-source library that is based on gradient boosting decision trees from the company Yandex [2]. In their documentation, they include GitHub references and examples, news, benchmarks, feedback, contacts, tutorial, and installation. If you are using XGBoost, LightGBM, or H2O, CatBoost documentation has benchmarked and proved that they are the best with both tuned and default results. Of course, if you have more categorical variables, then CatBoost is the way to go. Keep on reading below if you would like to learn more about this awesome library from Yandex.
The main reason I use CatBoost is that it is easy to use, efficient, and works especially well with categorical variables. As the name implies, CatBoost means ‘categorical’ boosting. It is quicker to use than, say, XGBoost, because it does not require the use of pre-processing your data, which can take the most amount of time in a typical Data Science model building process. Another problem that other algorithms have is when using categorical variables like ID’s, they create an impossible to compute matrix composed of thousands of columns made from dummy variables or one-hot-encoding. CatBoost fixes this problem in the way that it transforms its categorical variables, as you will see below.
Training
CatBoost builds upon gradient boosted decision trees including a training dataset, with accuracy determined on a validation dataset. In training, those decision trees are built consecutively with each tree having its loss reduced.
Split Calculation
Based on the starting parameters of CatBoost, quantization is used for the numerical features when determining the best ways for splitting data into buckets.
Categorical Feature Transformation
The main benefit, I think, of this algorithm is that it treats categorical feature transformation in the best way when compared to other Machine Learning algorithms.
For example, in classification, a permutation is performed randomly, then a calculation is performed from a standard formula unique to CatBoost (ordered target encoding):
target_average = countInClass + prior / totalCount + 1
Feature Importance
This library allows for some awesome visualizations, including the model training and testing process and a print out of feature importance to name a few. You can access ShapValues as well, which is becoming more popular in distinguishing more explainability for your model features. There are plenty of examples in the docs. Here [4] is a particularly useful link they have provided for visualizations:
There is both the CatBoostClassifier and the CatBoostRegressor, I will be discussing the CatBoostRegressor, while the classifier code is not too much different, other than the main type of algorithm you are using based on your target variable.
Here is some code you can use to build your initial CatBoost model — of course, you can tune your parameters as well:
from catboost import CatBoostRegressor# read in your specific datatraining_data = pd.read_csv('file_to_training_data.csv')evaluation_data = pd.read_csv('file_to_evaluation_data.csv')
training_labels = [y,y2,y3, etc.]
# establish, fit, and predict using CatBoostRegressorcat_features = ['person_type', 'cat_example_2', etx]model = CatBoostRegressor()model.fit(training_data, training_labels, cat_features)preds = model.predict(evaluation_data)
As you can see, in just a few lines of code, you can create your first base-line CatBoost regression model. It is similar to most other Machine Learning algorithms, with the most important part of code to include being the cat_features
parameter. You will have to distinguish your list of categorical features in cat_features.
Here are some of the many useful methods that are a part of this algorithm:
* get_all_params* get_feature_importance* load_model* randomized_search* save_model
There are a ton more of course, but these ones I have personally used the model, especially get_feature_importance
as it returns a nice summary of your important features.
As you can see, CatBoost has some useful benefits, with easy implementation. Some of the main features of this competitive library are that even without parameter tuning the default parameters provide for great results, categorical features do not need preprocessing, quick computation, increase in accuracy with less overfitting, and lastly, efficient predictions. Yandex researchers have provided an extremely useful library that can be utilized for several competition use cases, as well as career and production use cases. They have also proved that on many popular datasets that their benchmark quality is the best when compared to LightGBM, XGBoost, and H2O.
Overall, CatBoost is great at the following:
- fast
- easy to use
- accurate
- use of categorical variables in your feature set
- provides several useful methods
- provides several useful visualizations
- has been proven to be better than the previous leading Machine Learning algorithms
I hope you found this article both interesting and useful. Please feel free to comment down below if you have used CatBoost before or prefer something else. Do you agree that it is better and why or why not? I want to thank Yandex for an amazing library and documentation. Thank you for reading, I appreciate it!
If you would like to learn more in-depth about the CatBoost library, there is another article [6] that uncovers all of the parameters, what they mean, and how to tune them from Mariia Garkavenko.
[1] Photo by Pacto Visual on Unsplash, (2016)
[2] Photo by Fachry Zella Devandra on Unsplash, (2017)
[3] Yandex, CatBoost documentation, (2020)
[4] Yandex, CatBoost visualization, (2020)
[5] Photo by Joshua Aragon on Unsplash, (2019)
[6] Mariia Garkavenko, Categorical features parameters in CatBoost, (2020)