data partitioning

Data Partitioning

Overview

When training a machine learning algorithm, one usually starts with a sample data set. However,

Splitting a Dataset

The typical machine learning process splits the available dataset into two datasets,

Training Set - is used to train the model
Test Set - is used to estimate the model error on out of sample data.

Validation Set

Occasional, there are model parameters that are not trained during model training. A typical example of this is the lambda parameter in ridge regression. Lambda must be chosen prior to training. However, there is typically not a good reason to choose one lambda over another. It would be ideal if the lambda parameter (suusally referred to as a hyper-parameter) itself could be trained.

The solution to this is to partition off the test set, another set called the validation set. Then the model can be trained with different values of the hyper-parameters, and the model that performs best on the validation set is chosen. (that is the hyper-parameters that created the model that performed the best on the validation set.)

Implementation

sampling

randomize will shuffle the contents in an array and return a new array.


let rs = await import('/lib/statistics/sampling/v1.0.0/sampling.mjs')
let data = [1,2,3,4,5,6,7,8,9];
let split = rs.split(data, 3);

Try it!

Overview

Splitting a Dataset

Validation Set

Implementation

Contents