K-means for Beginners: How to Build from Scratch in Python

In this article, you will learning how to implement k-means entirely from scratch and gain a strong understanding of the k-means algorithm.

Introduction

The K-means algorithm is a method for dividing a set of data points into distinct clusters, or groups, based on similar attributes. It is an unsupervised learning algorithm which means it does not require labeled data in order to find patterns in the dataset.

K-means is an approachable introduction to clustering for developers and data scientists interested in machine learning. In this article, you will learning how to implement k-means entirely from scratch and gain a strong understanding of the k-means algorithm.

Article Overview

What is Clustering?

The goal of clustering is to divide items into groups such that objects in a group are more similar than those outside the group. Stepping aside from programming for a second, we can gain a better understanding of clusters.

Given this dataset, how would you group these items?

Just by visual inspections, you probably came to one of the two clusterings shown below. Some people would divide these six items along the lines of color. Others would divide the items using the shapes.

Two ways that you could cluster these six items

Both of these clusterings are valid and could be useful in specific scenarios. This is an important point when it comes to clustering. You can say that these six items fit into two clusters while someone else can come along and say they fit into three clusters. You are evaluating these items using whatever your chosen “similarity function” is. Other people can use a different “similarity function” and arrive at a different but still valid result.

In the same way, we can use whatever similarity function we want in k-means to compare two points.

How to Define Similarity in a Cluster?

The most common and straightforward way to compare the similarity of two data points is using distance. In the same way you would use Pythagorean theorem to find the length of the diagonal, we can calculate distance between two data points.

A visualization of using the Euclidean Distance formula between Object 1 and Object 2

This method of comparing similarity is called Euclidean distance. Another commonly used similarity function in K-means clustering is called Manhattan distance.

Manhattan distance works by summing the difference between the two points in each dimension

These two distance formulas are effective on a wide variety of datasets for clustering tasks. However, a given similarity function may not be effective on every distribution of data points. This poses the question, “What are the properties of a good similarity function?”.

Characteristics of a Good Similarity Function

There are three basic properties that can be used to evaluate if a similarity function is good. Of course, there are more criteria that could be considered but these are among the most important. Below the notation d(x,y) is read as “the distance between x and y”.

Symmetry: d(x,y) = d(y,x)

The property of symmetry is important because otherwise we could say that A looks like B, but B looks nothing like A.

Positive Separability: d(x,y) = 0 if and only if x = y

The property of positive separability is important because otherwise there can be two different data points, A and B, that we can not tell apart.

Triangular Inequality: d(x,y) \leq d(x,z) + d(z,y)

The triangular inequality is important because otherwise we could say A looks like B, and B looks like C, but A doesn’t looks like C.

Overview of Common Clustering Methods

This article is focused on K-means clustering, but this is just one of many clustering algorithms that exist.

Partitional Clustering

K-means is an example of a partitional clustering algorithm (also known as centroid-based clustering). This means that it takes in a user supplied amount of clusters, k, and divides the data into that many partitions. In partitional clustering, each data point can only belong to a single cluster and no cluster can be empty.

Many partitional clustering algorithms are non-deterministic. This means that if you could end up with different clusters each time you run the algorithm even if you keep the inputs fixed.

Hierarchical Clustering

Algorithms under the umbrella of hierarchical clustering assign objects to clusters by building a hierarchy from either the top down or bottom up.

The top down approach is called Divisive clustering. It works by starting with all points in one cluster and then splitting the least similar clusters at each step until each data point is in a singleton cluster.

The bottom up approach is called Agglomerative clustering. This approach iteratively merges the two most similar points in a cluster until there is only one big cluster.

Unlike the partitional clustering approaches, hieerarchical clustering is deterministic. This means that cluster assignment will not vary between runs on the same dataset.

Both top down and bottom up approaches produce a tree-based hierarchy of points called a dendrogram. This dendrogram proves useful for selecting the number of clusters that you want to use. You can cut the tree at any depth to produce a subset of disjoint trees, each representing a cluster.

Comparing Partitional Clustering and Hierarchical Clustering

There are strengths and weaknesses to both methods of clustering. First, let’s take a look at the advantages and disadvantages of using K-means (partitional clustering).

Advantages of K-meansDisadvantages of K-means
Easy to ImplementDifficult to predict the number of clusters (K-Value)
k-Means may be faster than hierarchical clustering with a large number of variables (if K is small)Initial seeds have a strong impact on the final results
k-Means may produce tighter clusters than hierarchical clusteringThe order of the data has an impact on the final results
An instance can change cluster (move to another cluster) when the centroids are recomputedSensitive to scale: rescaling your datasets (normalization or standardization) will
completely change results.

Now, let’s take a look at the strengths and weaknesses of hierarchical clustering.

Advantages of hierarchical clusteringDisadvantages of hierarchical clustering
Outputs a hierarchy that is more informative than the unstructured set of clusters returned by k-means. Once a point has been assigned to a
cluster, it can no longer be moved around.
Easy to implementTime complexity: not suitable for large datasets
Initial seeds have a strong impact on the final results
The order of the data has an impact on the final results

I have given an overview of two families of clustering algorithms. There are other types of clustering algorithms that I did not cover such as Density-based clustering. For a more complete overview of clustering algorithms, visit this resource.

How does K-means Clustering work visually?

This visualization shows the k-means algorithm running with 3 clusters. First, the three centroids are initialized. These are the initial centers of each cluster and are represented by the large blue, red and green dots.

Next, the distance between a given data point and each of the three centroids is calculated. This is done for all the data points. Each point is assigned to the cluster of the centroid it is closest to. This happens when I click Reassign Points in the visualization.

Then, the centroids are recalculated by averaging the coordinates of each datapoint in the respective cluster. This happens when I click Update Centroids.

This process of reassigning points and updating centroids continues until the centroids no longer move.

You can use this visualization to gain an intuitive understanding of k-means yourself! Try it here.

What is the K-means Pseudocode?

The pseudocode for the k-means algorithm

How to write K-means from Scratch in Python?

Our k-means implementation will be divided into five helper methods and one main loop that runs the algorithm. Let’ go through the functions one-by-one.

  • Pairwise Distance
  • Initialize Centers
  • Update Assignment
  • Update Centers
  • Calculate Loss
  • Main Loop
  • The Complete Implementation

Calculate Pairwise Distance

def pairwise_dist(self, x, y):
        """
        Args:
            x: N x D numpy array
            y: M x D numpy array
        Return:
            dist: N x M array, where dist2[i, j] is the euclidean distance between 
            x[i, :] and y[j, :]
        """
        xSumSquare = np.sum(np.square(x),axis=1);
        ySumSquare = np.sum(np.square(y),axis=1);
        mul = np.dot(x, y.T);
        dists = np.sqrt(abs(xSumSquare[:, np.newaxis] + ySumSquare-2*mul))
        return dists

The pairwise_dist function is thee equivalent of the similarity function described earlier. This is the metric by which we compare the similarity of two points. Here I am using the Euclidean Distance. The formula I am using may look different from the regular formulation of the euclidean distance function. This is because we are doing the operation matrix wise instead of using two single vectors. Read in depth about this here.

Initializing the Centers

def _init_centers(self, points, K, **kwargs):
        """
        Args:
            points: NxD numpy array, where N is # points and D is the dimensionality
            K: number of clusters
            kwargs: any additional arguments you want
        Return:
            centers: K x D numpy array, the centers. 
        """
        row, col = points.shape
        retArr = np.empty([K, col])
        for number in range(K):
            randIndex = np.random.randint(row)
            retArr[number] = points[randIndex]
        
        return retArr

This function takes in the array of points and choses K of them at random to be the initial centroids. The function simply returns the K selected points.

Update Assignment

def _update_assignment(self, centers, points):
        """
        Args:
            centers: KxD numpy array, where K is the number of clusters, and D is the dimension
            points: NxD numpy array, the observations
        Return:
            cluster_idx: numpy array of length N, the cluster assignment for each point

        Hint: You could call pairwise_dist() function.
        """
        row, col = points.shape
        cluster_idx = np.empty([row])
        distances = self.pairwise_dist(points, centers)
        cluster_idx = np.argmin(distances, axis=1)

        return cluster_idx

The update assignment function is responsible to choosing which cluster each point should belong to. First, I calculate the distance between every point and every centroid using the pairwise_dist function. Then, I get the index of the minimum distance for each row. The index of the minimum distance is also the index of the cluster assignment for the given datapoint, since we want to assign each point to the closest centroid.

Update Centers

def _update_centers(self, old_centers, cluster_idx, points):
        """
        Args:
            old_centers: old centers KxD numpy array, where K is the number of clusters, and D is the dimension
            cluster_idx: numpy array of length N, the cluster assignment for each point
            points: NxD numpy array, the observations
        Return:
            centers: new centers, K x D numpy array, where K is the number of clusters, and D is the dimension.
        """
        K, D = old_centers.shape
        new_centers = np.empty(old_centers.shape)
        for i in range(K):
            new_centers[i] = np.mean(points[cluster_idx == i], axis = 0)
        return new_centers

The update centers function is responsible for averaging all the points that belong to a given cluster. This average is the new centroid of the respective cluster. The function returns the array of new centers.

Calculate Loss

def _get_loss(self, centers, cluster_idx, points):
        """
        Args:
            centers: KxD numpy array, where K is the number of clusters, and D is the dimension
            cluster_idx: numpy array of length N, the cluster assignment for each point
            points: NxD numpy array, the observations
        Return:
            loss: a single float number, which is the objective function of KMeans. 
        """
        dists = self.pairwise_dist(points, centers)
        loss = 0.0
        N, D = points.shape
        for i in range(N):
            loss = loss + np.square(dists[i][cluster_idx[i]])
        
        return loss

The loss function is the metric by which we evaluate the performance of our clustering algorithm. Our loss is simply the sum of the square distances between each point and its cluster centroid. In our implementation, we first call pairwise distance to get the distance matrix between every point and every center. We select the proper distance2 that corresponds to the cluster for each point using the cluster_idx.

Main Loop

def __call__(self, points, K, max_iters=100, abs_tol=1e-16, rel_tol=1e-16, verbose=False, **kwargs):
        """
        Args:
            points: NxD numpy array, where N is # points and D is the dimensionality
            K: number of clusters
            max_iters: maximum number of iterations (Hint: You could change it when debugging)
            abs_tol: convergence criteria w.r.t absolute change of loss
            rel_tol: convergence criteria w.r.t relative change of loss
            verbose: boolean to set whether method should print loss (Hint: helpful for debugging)
            kwargs: any additional arguments you want
        Return:
            cluster assignments: Nx1 int numpy array
            cluster centers: K x D numpy array, the centers
            loss: final loss value of the objective function of KMeans
        """
        centers = self._init_centers(points, K, **kwargs)
        for it in range(max_iters):
            cluster_idx = self._update_assignment(centers, points)
            centers = self._update_centers(centers, cluster_idx, points)
            loss = self._get_loss(centers, cluster_idx, points)
            K = centers.shape[0]
            if it:
                diff = np.abs(prev_loss - loss)
                if diff < abs_tol and diff / prev_loss < rel_tol:
                    break
            prev_loss = loss
            if verbose:
                print('iter %d, loss: %.4f' % (it, loss))
        return cluster_idx, centers, loss

Now, in the main loop we can combine all the utility functions and implement the pseudocode. First, the centers are randomly initialized with _init_centers. Then, for the specified number of iterations, we repeat the update_assignment and update_centers steps. After each iteration, we calculate the total loss and compare it to the previous loss. If the difference is less than our threshold, the algorithm execution has finished.

Complete K-Means Class Implementation

from __future__ import absolute_import
from __future__ import print_function
from __future__ import division

import sys
import matplotlib
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import axes3d
from tqdm import tqdm
# Load image
import imageio


# Set random seed so output is all same
np.random.seed(1)


class KMeans(object):

    def __init__(self):  # No need to implement
        pass

    def pairwise_dist(self, x, y):  # [5 pts]
        """
        Args:
            x: N x D numpy array
            y: M x D numpy array
        Return:
                dist: N x M array, where dist2[i, j] is the euclidean distance between 
                x[i, :] and y[j, :]
                """
        xSumSquare = np.sum(np.square(x),axis=1);
        ySumSquare = np.sum(np.square(y),axis=1);
        mul = np.dot(x, y.T);
        dists = np.sqrt(abs(xSumSquare[:, np.newaxis] + ySumSquare-2*mul))
        return dists

    def _init_centers(self, points, K, **kwargs):  # [5 pts]
        """
        Args:
            points: NxD numpy array, where N is # points and D is the dimensionality
            K: number of clusters
            kwargs: any additional arguments you want
        Return:
            centers: K x D numpy array, the centers. 
        """
        row, col = points.shape
        retArr = np.empty([K, col])
        for number in range(K):
            randIndex = np.random.randint(row)
            retArr[number] = points[randIndex]
        
        return retArr

    def _update_assignment(self, centers, points):  # [10 pts]
        """
        Args:
            centers: KxD numpy array, where K is the number of clusters, and D is the dimension
            points: NxD numpy array, the observations
        Return:
            cluster_idx: numpy array of length N, the cluster assignment for each point

        Hint: You could call pairwise_dist() function.
        """
        row, col = points.shape
        cluster_idx = np.empty([row])
        distances = self.pairwise_dist(points, centers)
        cluster_idx = np.argmin(distances, axis=1)

        return cluster_idx

    def _update_centers(self, old_centers, cluster_idx, points):  # [10 pts]
        """
        Args:
            old_centers: old centers KxD numpy array, where K is the number of clusters, and D is the dimension
            cluster_idx: numpy array of length N, the cluster assignment for each point
            points: NxD numpy array, the observations
        Return:
            centers: new centers, K x D numpy array, where K is the number of clusters, and D is the dimension.
        """
        K, D = old_centers.shape
        new_centers = np.empty(old_centers.shape)
        for i in range(K):
            new_centers[i] = np.mean(points[cluster_idx == i], axis = 0)
        return new_centers

    def _get_loss(self, centers, cluster_idx, points):  # [5 pts]
        """
        Args:
            centers: KxD numpy array, where K is the number of clusters, and D is the dimension
            cluster_idx: numpy array of length N, the cluster assignment for each point
            points: NxD numpy array, the observations
        Return:
            loss: a single float number, which is the objective function of KMeans. 
        """
        dists = self.pairwise_dist(points, centers)
        loss = 0.0
        N, D = points.shape
        for i in range(N):
            loss = loss + np.square(dists[i][cluster_idx[i]])
        
        return loss

    def __call__(self, points, K, max_iters=100, abs_tol=1e-16, rel_tol=1e-16, verbose=False, **kwargs):
        """
        Args:
            points: NxD numpy array, where N is # points and D is the dimensionality
            K: number of clusters
            max_iters: maximum number of iterations (Hint: You could change it when debugging)
            abs_tol: convergence criteria w.r.t absolute change of loss
            rel_tol: convergence criteria w.r.t relative change of loss
            verbose: boolean to set whether method should print loss (Hint: helpful for debugging)
            kwargs: any additional arguments you want
        Return:
            cluster assignments: Nx1 int numpy array
            cluster centers: K x D numpy array, the centers
            loss: final loss value of the objective function of KMeans
        """
        centers = self._init_centers(points, K, **kwargs)
        for it in range(max_iters):
            cluster_idx = self._update_assignment(centers, points)
            centers = self._update_centers(centers, cluster_idx, points)
            loss = self._get_loss(centers, cluster_idx, points)
            K = centers.shape[0]
            if it:
                diff = np.abs(prev_loss - loss)
                if diff < abs_tol and diff / prev_loss < rel_tol:
                    break
            prev_loss = loss
            if verbose:
                print('iter %d, loss: %.4f' % (it, loss))
        return cluster_idx, centers, loss

Here is our complete K-means class implementation in Python. I encourage you to copy this code (or implement it yourself!) and run it on your own machine. This is the best way to cement your understanding of the K-means algorithm.

Let’s take a look at how to interface with our K-means class with an examples.

Performing Image Segmentation using K-means algorithm

One great practical application of the K-means application is for image segmentation. This means grouping an image into k clusters based on their color, thus reducing the image to a limited spectrum of colors. Luckily, we can achieve this with no modifications to our K-means class implementation and just a little bit more code.

The only utility function we need for this converts an input image into a matrix:

def image_to_matrix(image_file, grays=False):
    """
    Convert .png image to matrix
    of values.
    params:
    image_file = str
    grays = Boolean
    returns:
    img = (color) np.ndarray[np.ndarray[np.ndarray[float]]]
    or (grayscale) np.ndarray[np.ndarray[float]]
    """
    img = plt.imread(image_file)
    # in case of transparency values
    if len(img.shape) == 3 and img.shape[2] > 3:
        height, width, depth = img.shape
        new_img = np.zeros([height, width, 3])
        for r in range(height):
            for c in range(width):
                new_img[r, c, :] = img[r, c, 0:3]
        img = np.copy(new_img)
    if grays and len(img.shape) == 3:
        height, width = img.shape[0:2]
        new_img = np.zeros([height, width])
        for r in range(height):
            for c in range(width):
                new_img[r, c] = img[r, c, 0]
        img = new_img
    return img

Now, using this function, we are ready to do image segmentation!

image_values = image_to_matrix('./images/bird_color_24.png')

r = image_values.shape[0]
c = image_values.shape[1]
ch = image_values.shape[2]
# flatten the image_values
image_values = image_values.reshape(r*c,ch)

k = 6 # feel free to change this value
cluster_idx, centers, loss = KMeans()(image_values, k)
updated_image_values = np.copy(image_values)

# assign each pixel to cluster mean
for i in range(0,k):
    indices_current_cluster = np.where(cluster_idx == i)[0]
    updated_image_values[indices_current_cluster] = centers[i]
    
updated_image_values = updated_image_values.reshape(r,c,ch)

plt.figure(None,figsize=(9,12))
plt.imshow(updated_image_values)
plt.show()

First, I load in the image I want to use and convert it to a matrix. Replace the path with any image you want to use! Then, we flatten the image from an R x C x CH array down to an (R*C) x CH, where R, C, and CH are rows, columns, and channels. We initialize our KMeans class with the first set of parenthesis, and then invoke the __call__ function with the second set of parenthesis. We pass in our data points (the image values) and the desired number of clusters.

Once the algorithm has run, the color of each pixel in the image is updated to the color of the centroid of its respective cluster. Then we output the image using matplotlib.

This image has been reduced down to just six colors using the K-means algorithm

Choosing the Proper Number of Clusters

With the K-means algorithm, performance can vary widely depending on the number of clusters that you use. I will provide an overview of two methods for determining the best number of clusters, k, for a given dataset.

Elbow Method

In order to use the elbow method, you simply need to run your K-means algorithm multiple times, increasing the number of clusters by one each iteration. Record the loss for each iteration and then make a line graph of num clusters vs loss.

Here is a simple implementation of the elbow method:

def find_optimal_num_clusters(self, data, max_K=15):
        """
        Plots loss values for different number of clusters in K-Means

        Args:
            image: input image of shape(H, W, 3)
            max_K: number of clusters
        Return:
            None (plot loss values against number of clusters)
        """
        y_val = np.empty(max_K)

        for i in range(max_K):
            cluster_idx, centers, y_val[i] = KMeans()(data, i + 1)
            
        plt.plot(np.arange(max_K) + 1, y_val)
        plt.show()
        return y_val

Running this method will output a plot that looks something like what you see below:

Now, in order to choose the correct number of clusters, we perform visual inspection. Thee spot where the loss curve starts to bend is known as the elbow point. The elbow point represents a reasonable trade-off between error and the number of clusters. In this example, the elbow point is at x = 3. This means the optimal number of clusters is 3.

Silhouette Coefficient

The average silhouette of the data is another useful criterion for assessing the natural number of clusters. The silhouette of a data instance is a measure of how closely it is matched to data within its cluster and how loosely it is matched to data of the neighbouring cluster.

Silhouette value is a measure of how similar an object is to its own cluster (cohesion) compared to other clusters (separation). The silhouette ranges from −1 to +1, where a high value indicates that the object is well matched to its own cluster and poorly matched to neighboring clusters. If most objects have a high value, then the clustering configuration is appropriate. If many points have a low or negative value, then the clustering configuration may have too many or too few clusters.

I recommend using scikit-learn if you want to implement Silhouette Coefficient for cluster analysis. Visit this resource for a complete guide on implementation.

Test Your Understanding

Ponder these questions to check that you have a solid understanding of the K-means algorithm. Also checkout my list of 8 Unique Machine Learning Interview Questions on K-means.

Will different initialization of the K-means algorithm lead to different clusters?

  • Yes
  • No
  • Sometimes

Visit this link once you think you have the answer for a brief summary.

Will the K-means algorithm always stop after a finite number of iterations?

  • Yes
  • No
  • Sometimes

Visit this link once you think you have the answer for a brief summary.

Conclusion

In this article we have covered a lot about clustering algorithms and the python implementation of K-means. Now that you have an understanding of K-means and how to implement the algorithm, you can work on applying these skills practically.

Try to run the K-means class implementation on different types of datasets. In this article we experimented with image data, but you can use all sorts of data with K-means. If you are struggling with ideas, this resource has some great unique ideas for applications of K-means that you can try to implement.

Avi Arora
Avi Arora

Avi is a Computer Science student at the Georgia Institute of Technology pursuing a Masters in Machine Learning. He is a software engineer working at Capital One, and the Co Founder of the company Octtone. His company creates software products in the Health & Wellness space.