elections analytics - stratified sampling

Stratified Sampling

Overview

Stratified sampling occurs when the original population is segmented into strata first, and then random samples are taken from each strata.

Stratified sampling is useful when the analyst wants to gather information about the strata as well as the population as a whole. In addition, stratified sampling may help to combat selection bias by ensuring that each strata is well represented in the final sample.

If it is known that certain strata contain larger variability than other strata, then the analyst can design their sample to take a larger number of samples in the more variable strata.

Sampling

Let {% N_1, N_2,...,N_L %} represent the population of each strata, and {% n_1,n_2,...,n_L %} represents the number of samples taken from each, the number of possible samples taken from strata {% i %} is

{% \binom{N_i}{n_i} %}

that is, the number of combinations of {% N_i %} things taken {% n_i %} at a time.

The total number of samples taken across strata is

{% \binom{N_1}{n_1} \times \binom{N_2}{n_2} \times ... \times \binom{N_L}{n_L} %}

(see combinatorics)

Computing Statistics

To compute an average for the population, one computes a weighted average of each strata.

{% \bar{X} = \frac{\sum N_h \bar{X}_h}{N} = \sum W_h \bar{X}_h %}


let total = $list(data).map(p=>p.total).sum();
let average = $list(data).map(p=>
				(p.total/total)*$list(p.sample)
						.map(p=>p.value).average())
				.sum();

Try it!

Aggregate of Polls

The stratified random sample methodology can be used to compute a aggregate from a set of independent polls. The averages from each poll is aggregated as above using weights for each poll.

In the simplest case, each poll is weighted by the sample size of the poll. If the analyst believes that some polls are more reliable than others, she can adjust the weights to reflect this belief.