## A Step by Step Guide

If you always wanted to learn decision trees, just by reading this, you’ve received a beautiful opportunity, a stroke of luck, I might say. But as entrepreneurs say, “it’s not enough to be in the right place; you also need to take the opportunity when it comes”. So, consider giving yourself some time and get comfortable, because what’s coming it’s not the shortest guide, but it might be the best.

Decision trees are the ** foundation** of the most well performing algorithms on Kaggle competitions and real-life. An indicator of this is that you are certainly going to collide with a “max_depth” on almost every ensemble. That’s precisely the reason why the story you are about to read uses the most relevant hyperparameters to build its narrative.

I never really learn decision trees until I decided to start looking at every hyperparameter. After studying them in detail, all the fog in my mind slowly vanished. I wrote this thinking of the guide I wish I had when I started looking for information, so it’s intended to be very easy to follow.

Don’t get me wrong. I don’t want to look like a smug repeating the typical smarty phrase “you should know all the foundations before doing anything” because I don’t believe in it that much.

** The learning process is not linear**, and if you want to train your first model, maybe it’s better for you just to know what a random forest is and go for it. It’s a significant step!. But after your first model training, I’m sure that you will realize that some problems may arise.

You will have neither the compute infrastructure nor the time to try every hyperparameter combination. Understanding what’s happening behind the scenes when you tweak them will give you a smarter starting point.

I can assure you, and remember this one, the next time you define your hyperparameter range, you will start hypothesizing on the best combination, checking which one worked the best. That exercise will ** improve your skills** a lot more than just a blindfolded grid-search.

For this guide, you ** don’t need any prior knowledge** of decision trees. That’s the magic. The algorithm will be built upon the hyperparameters. We can kill two birds with one shot: learn how decision trees work and how the hyperparameters modify their behavior. I really encourage you to read every hyperparameter because we will be building blocks on the algorithm’s underlying mechanics while we review them.

Just for the sake of the conversation, I want to clarify the parameters used on the ** scikit-learn implementation** for classification. The idea is to mix intuition and math to create the perfect learning combination for you.

Note:At the time this is being written, there are 12 parameters not deprecated on the scikit-learn documentation. As there are so many, we will focus just on 6. But if you support this, I will certainly come back with the others. Also, the word hyperparameter and parameter will be used interchangeably.

Decision trees are implemented on scikit using an optimized version of the CART algorithm. The procedure is to ask questions about the data to split it into ** two** groups. This is very important because other implementations can split the data into more than two groups at once.

This happens over and over until one of our beloved parameters stops this process or when the data cannot be split anymore (1 sample). Before the first question, the place where we have all the data is called the “root”. After that, every subset of samples is a “node” if there is a split after it. If this node can’t be split, we will call it a “leaf”.

On the image, you can see the overall structure. The illustration and title are on the left of its corresponding boxes. If you don’t understand what’s happening here, don’t hesitate to keep reading, everything will be clear after some minutes.

Ok, that’s a pretty general approximation, but you might have some fair questions like: “how to decide the variable for the split?”, “how to choose the threshold for the split?”, “can I use the same feature again?” or “is it the depth related to the number of splits?”, among many others.

The goal of what you are going to read is to answer those questions and a lot more. On every (hyper)parameter, you will discover new clues that will help you to actually ** see** how trees grow. And, if someday someone gives you some data, a pencil, and a piece of paper, you will be well equipped to solve the problem by yourself (super useful!).

Are you ready? Let’s do it!

This parameter isn’t a “game-changer” but it’s pretty useful to dispel how the basic mechanics of decision trees work. As the old saying says, “let’s start by the beginning”.

As we saw, on every iteration, we have to determine which feature we will select to ask the question to our data. Because of that, we should have a method to rank different options.

The functions to measure the quality of the split that we will review are two: ** Gini **and

**.**

*entropy*

Gini:a great way to understand it is by checkingWikipedia’sdefinition“Gini impurity is a measure of how often a randomly chosen element from the set would be incorrectly labeled if it was randomly labeled according to the distribution of labels in the subset”.

To better explain this concept, we will use the most original example that I could think of: balls. Imagine a case where we have 5 balls, we know the weight of these balls, they can be big or small and red or green.

The goal is to predict the color of the balls based on their weight and size. Now is when we should start using our Gini measurement to choose if we will split our data based on the ball’s weight or size.

Before the mathematical formulation, let’s try to figure out what this measurement is trying to do. Let’s replicate the Gini impurity definition for this particular case in our “root node”. First, we need to randomly select an element, and after that, we need to label it following the original distribution.

In this case, we selected the big green ball of 4 kilos, and we should label it as red with a probability of 0.6 and as green in the remaining 0.4. The problem is that we can’t calculate it with just one example. The Gini impurity aims to determine how often, after this procedure, we would end up with an incorrectly labeled sample as a whole.

A possible solution could be to simulate this process and repeating it enough times to get a good estimate. We can try that with the following code.