K-Means Clustering
Overview
K-Means clustering is an algorithm for finding a set of k clusters in a dataset of n points. Each cluster is designated by
a centroid, that is, a point which represents the average of the points in the cluster.
{% \vec{\mu}_i = \frac{1}{n_i} \sum \vec{x}_j %}
The algorithm attempts to find k centroids such that the squared error is minimized.
{% error = \sum_{i=1}^k \sum _j || \vec{x}_j - \vec{\mu}_i ||^2 %}
The algorithm is not guaranteed to find the global minimum of the error, but it will converge to a local minimum. This means that it may
make sense to run the algorithm multiple times and the choose the best result.
Algorithm
The algorithm begins by randomly generating k points, taken to be an initial guess of the k centroids.
Next it iterates the following steps.
- Assign every datapoint to the centroid that it is closest to
- Update the centroids - after the points have been assigned a centroid, each centroid is computed to be the
mean of the points that are assigned to it.
Choosing the Number of Clusters
The k-means algorithm relies on the modeler choosing a number of clusters at the outset of running the algorithm. There are no clear
and exact rules for how to find the optimal number of clusters to use. The following represent different methods for
helping to determine the number of clusters to use.
Examples
The k-means clustering examples work by first generating k random points. Then it runs by running several iterations of corrections to the k centroids.
K-Means Clustering Library
K-Means Clusering
- implements the k-means clustering algorithm.
Video Demos