Ensemble Learning
Overview
Ensemble learning refers to a machine learning alogrithm that is obtained by treaining several separate machine
learning algorithms, and then using an algorithm to create an answer from the model ensemble.
An important criteria in selecting the models to use in an ensemble is to make sure that the ensemble represents
a set of diverse models, that is, you want models that dont always agree. In particular, it is good to have models
that have different sets of strenghts. Some models may misclassify a set of points, but another set may be good
at classifying those points. In having a diverse set, you are hoping to have a set of models such that for any
point the dataset, one or more models can classify it correctly.
ensemble modeling resources
Voting or Averaging
A simple way to build a model from a collection of other models is either average the results of the individual, or give
each model a vote in the outcome, and then tally the votes.
Resampling (Bagging)
One common procedure to create robust modeling from a fixed dataset is to create multiple datasets by resampling from the
dataset, and then training a model on each resampled set (this is known as bagging). Typically, this is done using the same model each time. As
an example, one may train a decision tree on each resampled set, and then average the result, which is now known as
a forest.
Code for resampling from a dataset can be found on the
resampling page.
Boosting
An improvement to the bagging algorithm is the boosting algorithm. Boosting is similar to the bagging algorithm except that when it resamples
the dataset, it focuses on creating a dataset that emphasizes the datapoints where the currently trained models are weak, or misclassify
the data.
The
ada boost algorithm is the standard boosting algorithm.