What Is k-means Clustering?

Data mining with the k-means algorithm

Three business people reviewing data graphs on a large screen
 Monty Rakusen/Getty Images

The k-means clustering algorithm is a data mining and machine learning tool used to cluster observations into groups of related observations without any prior knowledge of those relationships. By sampling, the algorithm attempts to show in which category, or cluster, the data belong to, with the number of clusters being defined by the value k.

The k-means algorithm is one of the simplest clustering techniques and it is commonly used in medical imaging, biometrics, and related fields. The advantage of k-means clustering is that it tells about your data (using its unsupervised form) rather than you having to instruct the algorithm about the data at the start (using the supervised form of the algorithm).

It is sometimes referred to as Lloyd's Algorithm, particularly in computer science circles because the standard algorithm was first proposed by Stuart Lloyd in 1957. The term "k-means" was coined in 1967 by James McQueen.

How the k-means Algorithm Functions

The k-means algorithm is an evolutionary algorithm that gains its name from its method of operation. The algorithm clusters observations into k groups, where k is provided as an input parameter. It then assigns each observation to clusters based upon the observation’s proximity to the mean of the cluster. The cluster’s mean is then recomputed and the process begins again. Here’s how the algorithm works:

  1. The algorithm arbitrarily selects k points as the initial cluster centers (the means).
  2. Each point in the dataset is assigned to the closed cluster, based upon the Euclidean distance between each point and each cluster center.
  3. Each cluster center is recomputed as the average of the points in that cluster.
  4. Steps 2 and 3 repeat until the clusters converge. Convergence may be defined differently depending upon the implementation, but it normally means that either no observations change clusters when steps 2 and 3 are repeated, or that the changes do not make a material difference in the definition of the clusters.

Choosing the Number of Clusters

One of the main disadvantages to k-means clustering is the fact that you must specify the number of clusters as an input to the algorithm. As designed, the algorithm is not capable of determining the appropriate number of clusters and depends upon the user to identify this in advance.

For example, if you had a group of people that are to be clustered based upon binary gender identity as male or female, calling the k-means algorithm using the input k=3 would force the people into three clusters when only two, or an input of k=2, would provide a more natural fit.

Similarly, if a group of individuals were easily clustered based upon home state and you called the k-means algorithm with the input k=20, the results might be too generalized to be effective.

For this reason, it’s often a good idea to experiment with different values of k to identify the value that best suits your data. You also may wish to explore the use of other data mining algorithms in your quest for machine-learned knowledge.