PyCaret is a low-code machine learning library that automates all of the machine learning workflows. What it does is that it provides a wrapper for popular machine learning libraries such as scikit-learn, XGBoost, LightGBM, CatBoost, and many more.
With PyCaret, we can basically build our machine learning model for classification, regression, clustering, anomaly detection, or NLP problems in just a few lines of code.
If you haven’t installed PyCaret, you can easily do so by typing the following pip command.
pip install pycaret
Since we’re going to solve a classification problem, then we need to use pycaret.classification
module. If you solve different problems, then you need to use other modules, which you can find out more in PyCaret official documentation page.
Experiment 1: Default Setup
First things first, we need to setup our PyCaret environment with setup()
function. This function needs to be called first before we call other functions in PyCaret.
from pycaret.classification import *exp_clf01 = setup(data = wine_df, target = 'quality', session_id = 123)
As you can see, we passed two parameters as the argument for setup()
function:
data
— Our input data.target
— The name of the feature that we want to predict (dependent variable).session_id
— The identifier for our setup environment.
If you run the code snippet above, you’ll get the following outputs:
From the output above, you can see that the setup()
function will automatically split our data into train set and test set.
Also, it will automatically infer the data types of your features: whether your feature is a numerical feature or a categorical feature. You need to take a look at the output carefully because there are times when the function infers the data types incorrectly. If you find that one of the feature is inferred incorrectly, you can correct it by doing the following:
exp_clf01 = setup(data = wine_df, target = 'quality', session_id = 123, categorical_features = ['feature1', 'feature2'], numerical_features = ['feature3', 'feature4'])
You can use categorical_features
or numerical_features
parameter to change the data types that are incorrectly inferred by setup()
function. You need to pass a list of string of the name of the features that you want to change.
Next, let’s build our classifier model.
When we want to build a machine learning model, most of the times we don’t know in advance which models that will give us the best performance according to our metrics. With PyCaret, you’re able to compare the performance of different kinds of classification models with literally single line of code.
best = compare_models()
As you can see, it turns out that Random Forest classifier gives us the best performance in 5 out of 7 metrics. Let’s say we want to use F1 score metrics for our wine classifier, then of course Random Forest classifier will give us the best performance.
Experiment 2: Tuned Setup
Before we go further, let’s see whether we can improve the performance of the models by tuning our setup()
function.
exp_clf102 = setup(data = train_data, target = 'quality', session_id=123, normalize = True, transformation = True)
As you can see, we passed several additional parameters there to tune our setup:
normalize
— To transform our features by scaling them to a given range.transformation
— To transform our features such that our data can be represented by normal distribution. This can be helpful for models like Logistic Regression, LDA, or Gaussian Native Bayes.
There are a lot of tuning options that you can do inside this setup()
function. You can learn more about it here.
Also note that we use the same session_id
as our previous setup()
function. This is to make sure that all of the future improvements on the model are solely due to the change that we’ve implemented in this setup()
function.
Let’s compare the models once again with our new setup.
As you can see, most of the metrics are slightly improved after we tuned the setup. Before we tuned the setup, the F1 score of Extra Tree classifier is 0.8306. After we tuned the setup, the F1 score becomes 0.8375.
Based on this result, let’s build our Extra Tree classifier. We can do this with a single line of code.
et_model = create_model('et')
Next, you can evaluate your model by looking at the visualization of the ROC curve, feature importance, or confusion matrix of your model with also a single line of code.
evaluate_model(et_model)
As a final check, we can use our Extra Tree classifier to predict the test data that has been generated by PyCaret. As mentioned earlier, soon after we executed the setup()
function at the very first step, PyCaret will automatically split our data into training data and test data. All of the model performance and evaluation metrics that we’ve seen above are solely based on the training data.
To use the model to predict the test data, we can use the predict_model
function.
predict_model(et_model)
Finally, let’s save our Extra Tree classifier model.
save_model(et_model, model_name = 'extra_tree_model')
And that’s it for model building with PyCaret. After this you should have a pickle file called ‘extra_tree_model’ in your working directory. We will use this saved model to build a wine classifier web app with Streamlit.