There are several ways to express Feature Importance. It can be score representing the features’ relevancy (using algorithm-based Feature Importance, such as tree-based Random Forest, XGBoost,..), dimension reduction from original features to synthesize all the information to a lower dimension (like Autoencoder or PCA), or domain knowledge selection (aka selection based on your own knowledge). With the variety of choices like this, which one should we choose? Even when we choose to use PCA, selecting the right number of remaining features seems to demand great exertion. Don’t you think we should go with the Feature Importance score or testing everything and see which performs the best?
Yes, it’s correct, testing everything is the accurate answer. From my experience, I usually set the baseline as the original model without any selection, then try PCA with different drop-out ratios and check the metric. Then drop original features (mainly because I want to limit the amount of collected data needed, as well as optimize the training time while maintain or improve the performance). So Dimension Reduction is my first choice and the Feature Importance is the next essential step. But using only the Feature Importance score is hardly a good choice. Let me show you why.
Feature Importance score here means any score generated from the trained model representing the weight or relevancy of the features to the prediction of the target feature.
Below are the feature importance scores of Random Forest calculated based on RandomForestRegressor model (detail of formula is here), selecting feature with SelectFromModel (detail), and permutation score (detail of formula is here). Data used for demonstration is California housing in sklearn.