Random Forests 🧙‍♂️

Yet another quality of Random Forests is that they make it easy to measure the relative importance of each feature. Scikit-Learn measures a feature’s importance by looking at how much the tree nodes that use that feature reduce impurity on average.

If we talk more precisely, it is a weighted average, where each node’s weight is equal to the number of training samples that are associated with it.🧑‍🎨

Where scikit-learn computes this score automatically for each feature after 👨‍🔧training, then it scales the results so that the sum of all importances is equal to 1. You can access the result using the feature_importances_ variable.

For instance, the following code trains a RandomForestClassifier on the iris dataset and output each feature’s importance. It seems that the most 🧑‍✈️importance features are the petal length (44%) and width (42%), while sepal length and width are rather unimportant in comparison (11% and 2% respectively).

>>> from sklearn.datasets import load_iris>>> iris = load_iris()>>> rnd_clf = RandomForestClassifier(n_estimators=500, n_jobs=-1)>>> rnd_clf.fit(iris["data"], iris["target"])>>> for name, score in zip(iris["feature_names"], rnd_clf.feature_importances_):
... print(name, score)
...sepal length (cm) 0.112492250999
sepal width (cm) 0.0231192882825
petal length (cm) 0.441030464364

MNIST pixel importance (according to a Random Forest classifier)

Similarly , if you train a Random Forest classifier on the MNIST dataset and plot each pixel’s importance. Random Forests are very handy to get a quick understanding of what features actually matter, in particular if you need to perform feature selection.🤩

Footer