After validating Random Forest is appropriate model, it is time to tune hyperparameters for maximum performance. We will use GridSearchCV from sklearn to tune our hyperparameters which is very simple to undestand, it tries all combinations of hyperparameters given in param_grid
and calculate model performance with each combination by using K-fold cross validation. In our case there will be (3*3*6)*cv = 162 training.
best_estimator_ method returned model with max_depth =100, max_features=12, n_estimators=300 which we could use to calculate train and test MAE, outputting
train_MAE = 3.55
test_MAE = 9.69
test_MAE decreased by 5.4% which is pretty good but we need to take into consideration that is used 300 decision trees compared to 100 decision trees it will use more computation.
Since we do not need to consider computation time for training let’s use Random Forest from above to see feature importance.
feature importance for each feature is calculated by averaging amount of impurity reduced by using such feature at each node across all trees in Random Forest. For example if PM10 feature was used in 10 nodes and at each node it reduced [1,2,3,4,…10] impurity than its feature importance would be (1+2+3+..+10)/10 = 5.5
. Each feature importance are calculated in this way after training, then it is scaled so that sum of all importances equal to 1.
Since feature importances are calculated by how much each feature decreases impurity on average using training model if the model performs poorly on test set, trust worthy level of feature_importances_ decrease.