Model Selection Techniques -Parsimony & Goodness of Fit

Introduction

By Definition:

A parsimonious model is a model that accomplishes the desired level of explanation or prediction with as few predictor variables as possible.

The goodness of fit of a statistical model describes how well it fits a set of observations.

Measures of goodness of fit typically summarize the discrepancy between observed values and the values expected under the model in question.

The idea behind parsimonious models stems from Occam’s razor, or “the law of briefness” (sometimes called lex parsimoniae in Latin). The law states that you should use no more “things” than necessary; In the case of parsimonious models, those “things” are parameters. Parsimonious models have optimal parsimony or just the right number of predictors needed to explain the model well.

There are generally two ways of evaluating a model: Based on predictions and based on goodness of fit on the current data. In the first case, we want to know if our model adequately predicts new data, in the second we want to know whether our model adequately describes the relations in our current data. These are two different things.

Comparing the Models

There is generally a trade-off between goodness of fit and parsimony: low parsimony models (i.e. models with many parameters) tend to have a better fit than high parsimony models. This is not usually a good thing, as adding more parameters usually results in a good model fit for the data at hand, but that same model will likely be useless for predicting other data sets. Finding the right balance between parsimony and goodness of fit can be challenging.

Model Selection Approaches

Model selection can follow three approaches:

Evaluating based on predictions:

The best way to evaluate models used for prediction is cross-validation. Very briefly, we cut our dataset into say, 10 different pieces, use 9 of them to build the model and predict the outcomes for the 10th dataset. A simple mean squared difference between the observed and predicted values gives us a measure for the prediction accuracy. As we repeat this 10 times, we calculate the mean squared difference over all 10 iterations to come to a general value with a standard deviation. This allows us again to compare two models on their prediction accuracy using standard statistical techniques (t-test or ANOVA).

Evaluating based on goodness of fit:

This approach differs depending on the model framework we use. For example, a likelihood-ratio test can work for Generalized Additive Mixed Models when using the classic gaussian for the errors but is meaningless in the case of the binomial variant.

We have the more intuitive methods of comparing models, like the Akaike Information Criterion (AIC) or the Bayesian Information Criterion (BIC) to compare the goodness of fit for two models. Other methods like Mallow’s Cp criterion, Bayes Factors, Minimum Description Length (MDL) etc. are also popular.

Let’s explore some of these methods:

Akaike Information Criterion:

Akaike’s information criterion (AIC) compares the quality of a set of statistical models to each other. If we have a number of models to compare, the AIC will take each model and rank the models from best to worst. The best model will be the one that neither over-fits nor under-fits. The basic formula for the AIC is:

Where:

K is the number of model parameters (the number of variables in the model plus the intercept).
Log-likelihood is a measure of model fit. The higher the number, the better the fit. This is usually obtained from statistical output.

For small sample sizes (n/K < ≈ 40), use the second-order AIC:

Where:

n = sample size,
K= number of model parameters,
Log-likelihood is a measure of model fit.

Bayesian Information Criterion:

BIC is almost the same as the AIC, although it tends to favour models with fewer parameters. The BIC is also known as the Schwarz information criterion or Schwarz’s BIC. The basic formula for BIC is:

Here n is the sample size; the number of observations or number of data points you are working with. k is the number of parameters which your model estimates, and θ is the set of all parameters. L(θ̂) represents the likelihood of the model tested, given your data, when evaluated at maximum likelihood values of θ. You could call this the likelihood of the model given everything aligned to their most favourable.

Given any two estimated models, the model with the lower value of BIC is the one to be preferred. Unexplained variation in the dependent variable and the number of explanatory variables increases the value of BIC. Hence, lower BIC implies either fewer explanatory variables, better fit, or both. The BIC generally penalizes free parameters more strongly than does the Akaike information criterion, though it depends on the size of n and relative magnitude of n and k. It is important to keep in mind that the BIC can be used to compare estimated models only when the numerical values of the dependent variable are identical for all estimates being compared. The models being compared need not be nested, unlike the case when models are being compared using an F or likelihood ratio test.

Mallow’s Cₚ criterion:

Mallows’ Cₚ Criterion is a way to assess the fit of a multiple regression model. The technique then compares the full model with a smaller model with “p” parameters and determines how much error is left unexplained by the partial model. Or, more specifically, it estimates the standardized total mean square of estimation for the partial model with the formula:

Where:

SS(Res)ₚ = residual sum of squares from a model with a set of p-1 explanatory variables, plus an intercept (a constant),
s² = estimate of σ²

The smaller the Cₚ values are better, as it indicates smaller amounts of unexplained error. Models that have a small Cₚ and a Cₚ close to p. Alternatively, we may want to choose the smallest model for which Cₚ ≤ p is true.

Bayes Factors:

The Bayesian approach to model selection is straightforward. Prior probability distributions are used to describe the uncertainty surrounding all unknowns. After observing the data, the posterior distribution provides a coherent post data summary of the remaining uncertainty which is relevant for model selection. However, the practical implementation of this approach often requires carefully tailored priors and novel posterior calculation methods. According to the Bayes’s theorem, any model’s posterior probability can be written as:

Here, P(M|D) is the posterior probability of model M given the data D, P(D|M) is the evidence for the model M, P(M) is the prior knowledge about the model M, and P(D) is a normalization factor. When we have two competing models, we can compare their posterior probability as:

With this equation, we can compare two models and take the one with larger model evidence (when we have uninformative prior). It is similar to the Likelihood Ratio Test, but models do not have to be nested. Model selection based on Bayes Factors can be approximately equal to BIC model selection. However, BIC doesn’t require knowledge of priors so it is often preferred.

Automatic Model Selection:

When we are interested in prediction, we really have two goals for our regression model: 1)Accuracy — the larger the R² the more accurate will be our y’ values and 2)Efficiency — we don’t want any unnecessary (and perhaps expensive) predictors in the model. To meet these two (somewhat contradictory) goals we need to identify a set of predictors with two attributes — all the predictors are related to the criterion variable, and the predictors are not strongly related to each other (called “reduced collinearity”).

Over the years, there have three commonly used procedures for selecting a regression model with these characteristics from a larger set of predictors.

Forward Inclusion: Start with that predictor having the highest simple correlation, and on each successive step, add that variable which will produce the largest increase in R² (that with the largest partial correlation), stopping when an additional predictor will not increase R² significantly.
Backwards Deletion: Start with a full model, on successive steps, delete the predictor that contributes the least to the model (that with the least significant/largest regression weight p-value), stopping when deleting the next variable would produce a significant drop in R² (when all the variables in the model contribute).
Forward Stepwise Selection: Think of this one as a combination of forward and backward. Start with that predictor having the highest simple correlation. For the second step, add the variable that will increase R² the most (the one with the largest partial, but only if the R² increase is significant). Each successive step has two parts: if any predictor in the model is not contributing, toss it (if more than one, toss the one contributing the least, the one with the largest p-value), 2) if all variables in the model are contributing, then add that variable which will produce the largest increase in R² (that with the largest partial correlation, but only if the R² change will be significant). Stop when all the variables in the model are contributing, and when there is no additional predictor that will increase R² significantly.

Personally, I do not advocate the use of these techniques, as these have many drawbacks:

They yield R-squared values that are badly biased to be high.
The F and chi-squared tests quoted next to each variable on the printout do not have the claimed distribution.
These methods yield confidence intervals for effects and predicted values that are falsely narrow.
These have severe problems in the presence of collinearity.
These give biased regression coefficients that need shrinkage (the coefficients for remaining variables are too large (Tibshirani, 1996)
In many cases where we start at a different starting point, a stepwise selection may return a completely different model. These methods are far from stable.

Let us take an example to see why automated model selection may not be a good choice.

Imagine a high school track coach on the first day of try-outs. Thirty kids show up. These kids have some underlying level of intrinsic ability to which neither the coach nor anyone else, has direct access. As a result, the coach does the only thing he can do, which has them all run a 100m dash. The times are presumably a measure of their intrinsic ability and are taken as such. However, they are probabilistic; some proportion of how well someone does is based on their actual ability and some proportion is random. Imagine that the true situation is the following:

The results of the first race are displayed in the following figure along with the coach’s comments to the kids.

Note that partitioning the kids by their race times leaves overlaps on their intrinsic ability — this fact is crucial. After praising some, and yelling at some others (as coaches tend to do), he has them run again. Here are the results of the second race with the coach’s reactions (simulated from the same model above):

Notice that their intrinsic ability is identical, but the times bounced around relative to the first race. From the coach’s point of view, those he yelled at, tended to improve, and those he praised tended to do worse, although actually regression to the mean is a simple mathematical consequence of the fact that the coach is selecting athletes for the team based on a measurement that is partly random.

Now, what does this have to do with automated (e.g., stepwise) model selection techniques?

Developing and confirming a model based on the same dataset is sometimes called data dredging. Although there is some underlying relationship amongst the variables, and stronger relationships are expected to yield stronger scores (e.g., higher t-statistics), these are random variables and the realized values contain an error. Thus, when we select variables based on having higher (or lower) realized values, they may be such because of their underlying true value, error, or both. If we proceed in this manner, we will be as surprised as the coach was after the second race. This is true whether we select variables based on having high t-statistics, or low intercorrelations.

Conclusion

Although many developments have taken place in the automated model selection area, with examples of Libra and Pycaret, there exist many statistical and intuitive methods for the selection of the best model.

Let’s explore!

Reference: statisticshowto.com

Introduction

Comparing the Models

Model Selection Approaches

Conclusion

Footer