Clustering Sum of Squared Errors
Overview
The sum of squared errors of a set of clusters is defined to be
{% SSE = \sum_k \sum_{i \in k} (x_i - \mu_k)^2 %}
- {% x_i %} is a data point
- {% \mu_k %} is the centroid of the kth cluster
Given the definition, as the number of clusters increases, the sum of squared errors must by necessity go down. However,
Elbow
As the nuber of clusters increases, the SSE will decrease, but at a slower rate for each additional cluster. Sometimes, a kink in the
SSE curve appears, known as an "elbow", which shows a marked decrease in the rate of decline of the SSE. Many
analysts will choose the elbow as the optimal number of clusters.
Multiple elbows may appear in the chart, which then requires a bit of judgement as to which one to use.