Model Selection and Machine Learning
Overview
The
machine learning
process seeks to find a response function that can accurately predict the response to various data inputs.
It starts with a set of possible respose functions, then uses loss minimization to pick out one
response function from the set.
The regression algorithm finds the value of the regression coefficients that minimizes the squared error. That is, it chooses
a function from among a set of functions (the set of all linear functions of the independent variables). This process
assumes that the independent variables have already been specified. The regression algorithm can be broadened in order
to select the best set of independent variables as well as to find the best coefficients.
Combinatorial Search
A simple way to find the relevant factors to regress against is to do a brute force search of combinations of the factors.
That is, list all the possible combinations of the factors. Then regress each in turn. Then apply some criteria among
the regressions (such as maximum r-squared) to choose the best fit.
The
combinatorics library
can be used to create all the various combinations. The the
regression library
can be applied to generate the regression results.
let items = ['factor1', 'factor2', 'factor3', 'factor4', 'factor5', 'factor6'];
//get all combinations of 3 of the numbers
let combinations = cn.combinations(items, 3);
for(let factors of combinations){
let regress = olsregression.regressDataSet({
data: data,
y: 'y',
x: factors
});
}
The problem with this method of search is that for most measures of fit, such as r-squared, more
factors will always produce a better fit than fewer. This can lead to the model
being
overfit.
One way to combat an overfit model is to use one various
Information Criteria
measures to judge the fit of the model. These criteria are designed to create a penalty for model complexity
(that is, number of factors).
Ridge Regression