Machine Learning Sampling

Overview


Samping during the training process refers to using a sampling algorithm or strategy to build the best training set to use for machine learning training. That is, it represents a slight modificatin to the standard loss minimization routine.

Uncertainty Sampling


The technique uses a measure of uncertainty to identify the data records where the current model is most uncertain, and then trains the model on those points, or overweights those points in the next round of training.

Diversity Sampling


Diversity sampling seeks to build a dataset that is diverse across dimensions, including both the features for each record, but also the target classifications. In many datasets, certain demographics are overweighted in the natural population, so that they generally appear more often in training sets. This may cause the machine learning algorithm to learn those features well, at the expense of the less common features.

Diversity sampling seeks to overweight the less diverse features in order encourage the machine learning model to learn features and categories equally.

  • Cluster Sampling - a form of diversity sampling that uses a clustering algorithm to cluster the data into different clusters, and then samples from each cluster equally.