Clustering Sum of Squared Errors

Overview


The sum of squared errors of a set of clusters is defined to be
{% SSE = \sum_k \sum_{i \in k} (x_i - \mu_k)^2 %}
  • {% x_i %} is a data point
  • {% \mu_k %} is the centroid of the kth cluster

Given the definition, as the number of clusters increases, the sum of squared errors must by necessity go down. However,

Elbow


As the nuber of clusters increases, the SSE will decrease, but at a slower rate for each additional cluster. Sometimes, a kink in the SSE curve appears, known as an "elbow", which shows a marked decrease in the rate of decline of the SSE. Many analysts will choose the elbow as the optimal number of clusters.

Multiple elbows may appear in the chart, which then requires a bit of judgement as to which one to use.

Contents