ordinary least squares regression model selection

Model Selection and Machine Learning

Overview

The machine learning process seeks to find a response function that can accurately predict the response to various data inputs. It starts with a set of possible respose functions, then uses loss minimization to pick out one response function from the set.

The regression algorithm finds the value of the regression coefficients that minimizes the squared error. That is, it chooses a function from among a set of functions (the set of all linear functions of the independent variables). This process assumes that the independent variables have already been specified. The regression algorithm can be broadened in order to select the best set of independent variables as well as to find the best coefficients.

Combinatorial Search

A simple way to find the relevant factors to regress against is to do a brute force search of combinations of the factors. That is, list all the possible combinations of the factors. Then regress each in turn. Then apply some criteria among the regressions (such as maximum r-squared) to choose the best fit.

The combinatorics library can be used to create all the various combinations. The the regression library can be applied to generate the regression results.


let items = ['factor1', 'factor2', 'factor3', 'factor4', 'factor5', 'factor6'];

//get all combinations of 3 of the numbers
let combinations = cn.combinations(items, 3);

for(let factors of combinations){
  let regress = olsregression.regressDataSet({
    data: data,
    y: 'y',
    x: factors
  });
}

The problem with this method of search is that for most measures of fit, such as r-squared, more factors will always produce a better fit than fewer. This can lead to the model being overfit.

One way to combat an overfit model is to use one various Information Criteria measures to judge the fit of the model. These criteria are designed to create a penalty for model complexity (that is, number of factors).

Overview

Combinatorial Search

Ridge Regression

Contents