Ensemble Learning

Overview


Ensemble learning refers to a machine learning alogrithm that is obtained by treaining several separate machine learning algorithms, and then using an algorithm to create an answer from the model ensemble.

An important criteria in selecting the models to use in an ensemble is to make sure that the ensemble represents a set of diverse models, that is, you want models that dont always agree. In particular, it is good to have models that have different sets of strenghts. Some models may misclassify a set of points, but another set may be good at classifying those points. In having a diverse set, you are hoping to have a set of models such that for any point the dataset, one or more models can classify it correctly.

ensemble modeling resources

Voting or Averaging


A simple way to build a model from a collection of other models is either average the results of the individual, or give each model a vote in the outcome, and then tally the votes.

Resampling (Bagging)


One common procedure to create robust modeling from a fixed dataset is to create multiple datasets by resampling from the dataset, and then training a model on each resampled set (this is known as bagging). Typically, this is done using the same model each time. As an example, one may train a decision tree on each resampled set, and then average the result, which is now known as a forest.

Code for resampling from a dataset can be found on the resampling page.

Boosting


An improvement to the bagging algorithm is the boosting algorithm. Boosting is similar to the bagging algorithm except that when it resamples the dataset, it focuses on creating a dataset that emphasizes the datapoints where the currently trained models are weak, or misclassify the data.

The ada boost algorithm is the standard boosting algorithm.

Contents