Model Selection

Overview


Model selection in regression is the process of deciding which factors to use in a regression model.

Combinatorial Search


A simple way to find the relevant factors to regress against is to do a brute force search of combinations of the factors. That is, list all the possible combinations of the factors. Then regress each in turn. Then apply some criteria among the regressions (such as maximum r-squared) to choose the best fit.

The combinatorics library can be used to create all the various combinations. Then the regression library can be applied to generate the regression results.


let items = ['factor1', 'factor2', 'factor3', 'factor4', 'factor5', 'factor6'];

//get all combinations of 3 of the numbers
let combinations = cn.combinations(items, 3);

for(let factors of combinations){
  let regress = olsregression.regressDataSet({
    data: data,
    y: 'y',
    x: factors
  });
}
					


The problem with this method of search is that for most measures of fit, such as r-squared, more factors will always produce a better fit than fewer. This can lead to the model being overfit.

One way to combat an overfit model is to use one various Information Criteria measures to judge the fit of the model. These criteria are designed to create a penalty for model complexity (that is, number of factors).