k means clustering

K-Means Clustering

Overview

K-Means clustering is an algorithm for finding a set of k clusters in a dataset of n points. Each cluster is designated by a centroid, that is, a point which represents the average of the points in the cluster.

{% \vec{\mu}_i = \frac{1}{n_i} \sum \vec{x}_j %}

The algorithm attempts to find k centroids such that the squared error is minimized.

{% error = \sum_{i=1}^k \sum _j || \vec{x}_j - \vec{\mu}_i ||^2 %}

The algorithm is not guaranteed to find the global minimum of the error, but it will converge to a local minimum. This means that it may make sense to run the algorithm multiple times and the choose the best result.

Algorithm

The algorithm begins by randomly generating k points, taken to be an initial guess of the k centroids.

Next it iterates the following steps.

Assign every datapoint to the centroid that it is closest to
Update the centroids - after the points have been assigned a centroid, each centroid is computed to be the mean of the points that are assigned to it.

Choosing the Number of Clusters

The k-means algorithm relies on the modeler choosing a number of clusters at the outset of running the algorithm. There are no clear and exact rules for how to find the optimal number of clusters to use. The following represent different methods for helping to determine the number of clusters to use.

Examples

The k-means clustering examples work by first generating k random points. Then it runs by running several iterations of corrections to the k centroids.