Introduction
The K-means algorithm is a method for dividing a set of data points into distinct clusters, or groups, based on similar attributes. It is an unsupervised learning algorithm which means it does not require labeled data in order to find patterns in the dataset.
K-means is an approachable introduction to clustering for developers and data scientists interested in machine learning. In this article, you will learning how to implement k-means entirely from scratch and gain a strong understanding of the k-means algorithm.
Article Overview
- What is Clustering?
- How does K-means Clustering work visually?
- What is the K-means Pseudocode?
- How to write K-means from Scratch in Python?
- Image Segmentation with K-means algorithm
- Choosing the Proper Number of Clusters
- Test Your Understanding
- Conclusion
What is Clustering?
The goal of clustering is to divide items into groups such that objects in a group are more similar than those outside the group. Stepping aside from programming for a second, we can gain a better understanding of clusters.
Given this dataset, how would you group these items?
Just by visual inspections, you probably came to one of the two clusterings shown below. Some people would divide these six items along the lines of color. Others would divide the items using the shapes.
Both of these clusterings are valid and could be useful in specific scenarios. This is an important point when it comes to clustering. You can say that these six items fit into two clusters while someone else can come along and say they fit into three clusters. You are evaluating these items using whatever your chosen “similarity function” is. Other people can use a different “similarity function” and arrive at a different but still valid result.
In the same way, we can use whatever similarity function we want in k-means to compare two points.
How to Define Similarity in a Cluster?
The most common and straightforward way to compare the similarity of two data points is using distance. In the same way you would use Pythagorean theorem to find the length of the diagonal, we can calculate distance between two data points.
This method of comparing similarity is called Euclidean distance. Another commonly used similarity function in K-means clustering is called Manhattan distance.
These two distance formulas are effective on a wide variety of datasets for clustering tasks. However, a given similarity function may not be effective on every distribution of data points. This poses the question, “What are the properties of a good similarity function?”.
Characteristics of a Good Similarity Function
There are three basic properties that can be used to evaluate if a similarity function is good. Of course, there are more criteria that could be considered but these are among the most important. Below the notation d(x,y) is read as “the distance between x and y”.
Symmetry: d(x,y) = d(y,x)
The property of symmetry is important because otherwise we could say that A looks like B, but B looks nothing like A.
Positive Separability: d(x,y) = 0 if and only if x = y
The property of positive separability is important because otherwise there can be two different data points, A and B, that we can not tell apart.
Triangular Inequality: d(x,y) \leq d(x,z) + d(z,y)
The triangular inequality is important because otherwise we could say A looks like B, and B looks like C, but A doesn’t looks like C.
Overview of Common Clustering Methods
This article is focused on K-means clustering, but this is just one of many clustering algorithms that exist.
Partitional Clustering
K-means is an example of a partitional clustering algorithm (also known as centroid-based clustering). This means that it takes in a user supplied amount of clusters, k, and divides the data into that many partitions. In partitional clustering, each data point can only belong to a single cluster and no cluster can be empty.
Many partitional clustering algorithms are non-deterministic. This means that if you could end up with different clusters each time you run the algorithm even if you keep the inputs fixed.
Hierarchical Clustering
Algorithms under the umbrella of hierarchical clustering assign objects to clusters by building a hierarchy from either the top down or bottom up.
The top down approach is called Divisive clustering. It works by starting with all points in one cluster and then splitting the least similar clusters at each step until each data point is in a singleton cluster.
The bottom up approach is called Agglomerative clustering. This approach iteratively merges the two most similar points in a cluster until there is only one big cluster.
Unlike the partitional clustering approaches, hieerarchical clustering is deterministic. This means that cluster assignment will not vary between runs on the same dataset.
Both top down and bottom up approaches produce a tree-based hierarchy of points called a dendrogram. This dendrogram proves useful for selecting the number of clusters that you want to use. You can cut the tree at any depth to produce a subset of disjoint trees, each representing a cluster.
Comparing Partitional Clustering and Hierarchical Clustering
There are strengths and weaknesses to both methods of clustering. First, let’s take a look at the advantages and disadvantages of using K-means (partitional clustering).
Advantages of K-means | Disadvantages of K-means |
Easy to Implement | Difficult to predict the number of clusters (K-Value) |
k-Means may be faster than hierarchical clustering with a large number of variables (if K is small) | Initial seeds have a strong impact on the final results |
k-Means may produce tighter clusters than hierarchical clustering | The order of the data has an impact on the final results |
An instance can change cluster (move to another cluster) when the centroids are recomputed | Sensitive to scale: rescaling your datasets (normalization or standardization) will completely change results. |
Now, let’s take a look at the strengths and weaknesses of hierarchical clustering.
Advantages of hierarchical clustering | Disadvantages of hierarchical clustering |
Outputs a hierarchy that is more informative than the unstructured set of clusters returned by k-means. | Once a point has been assigned to a cluster, it can no longer be moved around. |
Easy to implement | Time complexity: not suitable for large datasets |
Initial seeds have a strong impact on the final results | |
The order of the data has an impact on the final results |
I have given an overview of two families of clustering algorithms. There are other types of clustering algorithms that I did not cover such as Density-based clustering. For a more complete overview of clustering algorithms, visit this resource.
How does K-means Clustering work visually?
This visualization shows the k-means algorithm running with 3 clusters. First, the three centroids are initialized. These are the initial centers of each cluster and are represented by the large blue, red and green dots.
Next, the distance between a given data point and each of the three centroids is calculated. This is done for all the data points. Each point is assigned to the cluster of the centroid it is closest to. This happens when I click Reassign Points in the visualization.
Then, the centroids are recalculated by averaging the coordinates of each datapoint in the respective cluster. This happens when I click Update Centroids.
This process of reassigning points and updating centroids continues until the centroids no longer move.
You can use this visualization to gain an intuitive understanding of k-means yourself! Try it here.
What is the K-means Pseudocode?
How to write K-means from Scratch in Python?
Our k-means implementation will be divided into five helper methods and one main loop that runs the algorithm. Let’ go through the functions one-by-one.
- Pairwise Distance
- Initialize Centers
- Update Assignment
- Update Centers
- Calculate Loss
- Main Loop
- The Complete Implementation
Calculate Pairwise Distance
def pairwise_dist(self, x, y):
"""
Args:
x: N x D numpy array
y: M x D numpy array
Return:
dist: N x M array, where dist2[i, j] is the euclidean distance between
x[i, :] and y[j, :]
"""
xSumSquare = np.sum(np.square(x),axis=1);
ySumSquare = np.sum(np.square(y),axis=1);
mul = np.dot(x, y.T);
dists = np.sqrt(abs(xSumSquare[:, np.newaxis] + ySumSquare-2*mul))
return dists
The pairwise_dist function is thee equivalent of the similarity function described earlier. This is the metric by which we compare the similarity of two points. Here I am using the Euclidean Distance. The formula I am using may look different from the regular formulation of the euclidean distance function. This is because we are doing the operation matrix wise instead of using two single vectors. Read in depth about this here.
Initializing the Centers
def _init_centers(self, points, K, **kwargs):
"""
Args:
points: NxD numpy array, where N is # points and D is the dimensionality
K: number of clusters
kwargs: any additional arguments you want
Return:
centers: K x D numpy array, the centers.
"""
row, col = points.shape
retArr = np.empty([K, col])
for number in range(K):
randIndex = np.random.randint(row)
retArr[number] = points[randIndex]
return retArr
This function takes in the array of points and choses K of them at random to be the initial centroids. The function simply returns the K selected points.
Update Assignment
def _update_assignment(self, centers, points):
"""
Args:
centers: KxD numpy array, where K is the number of clusters, and D is the dimension
points: NxD numpy array, the observations
Return:
cluster_idx: numpy array of length N, the cluster assignment for each point
Hint: You could call pairwise_dist() function.
"""
row, col = points.shape
cluster_idx = np.empty([row])
distances = self.pairwise_dist(points, centers)
cluster_idx = np.argmin(distances, axis=1)
return cluster_idx
The update assignment function is responsible to choosing which cluster each point should belong to. First, I calculate the distance between every point and every centroid using the pairwise_dist function. Then, I get the index of the minimum distance for each row. The index of the minimum distance is also the index of the cluster assignment for the given datapoint, since we want to assign each point to the closest centroid.
Update Centers
def _update_centers(self, old_centers, cluster_idx, points):
"""
Args:
old_centers: old centers KxD numpy array, where K is the number of clusters, and D is the dimension
cluster_idx: numpy array of length N, the cluster assignment for each point
points: NxD numpy array, the observations
Return:
centers: new centers, K x D numpy array, where K is the number of clusters, and D is the dimension.
"""
K, D = old_centers.shape
new_centers = np.empty(old_centers.shape)
for i in range(K):
new_centers[i] = np.mean(points[cluster_idx == i], axis = 0)
return new_centers
The update centers function is responsible for averaging all the points that belong to a given cluster. This average is the new centroid of the respective cluster. The function returns the array of new centers.
Calculate Loss
def _get_loss(self, centers, cluster_idx, points):
"""
Args:
centers: KxD numpy array, where K is the number of clusters, and D is the dimension
cluster_idx: numpy array of length N, the cluster assignment for each point
points: NxD numpy array, the observations
Return:
loss: a single float number, which is the objective function of KMeans.
"""
dists = self.pairwise_dist(points, centers)
loss = 0.0
N, D = points.shape
for i in range(N):
loss = loss + np.square(dists[i][cluster_idx[i]])
return loss
The loss function is the metric by which we evaluate the performance of our clustering algorithm. Our loss is simply the sum of the square distances between each point and its cluster centroid. In our implementation, we first call pairwise distance to get the distance matrix between every point and every center. We select the proper distance2 that corresponds to the cluster for each point using the cluster_idx.
Main Loop
def __call__(self, points, K, max_iters=100, abs_tol=1e-16, rel_tol=1e-16, verbose=False, **kwargs):
"""
Args:
points: NxD numpy array, where N is # points and D is the dimensionality
K: number of clusters
max_iters: maximum number of iterations (Hint: You could change it when debugging)
abs_tol: convergence criteria w.r.t absolute change of loss
rel_tol: convergence criteria w.r.t relative change of loss
verbose: boolean to set whether method should print loss (Hint: helpful for debugging)
kwargs: any additional arguments you want
Return:
cluster assignments: Nx1 int numpy array
cluster centers: K x D numpy array, the centers
loss: final loss value of the objective function of KMeans
"""
centers = self._init_centers(points, K, **kwargs)
for it in range(max_iters):
cluster_idx = self._update_assignment(centers, points)
centers = self._update_centers(centers, cluster_idx, points)
loss = self._get_loss(centers, cluster_idx, points)
K = centers.shape[0]
if it:
diff = np.abs(prev_loss - loss)
if diff < abs_tol and diff / prev_loss < rel_tol:
break
prev_loss = loss
if verbose:
print('iter %d, loss: %.4f' % (it, loss))
return cluster_idx, centers, loss
Now, in the main loop we can combine all the utility functions and implement the pseudocode. First, the centers are randomly initialized with _init_centers. Then, for the specified number of iterations, we repeat the update_assignment and update_centers steps. After each iteration, we calculate the total loss and compare it to the previous loss. If the difference is less than our threshold, the algorithm execution has finished.
Complete K-Means Class Implementation
from __future__ import absolute_import
from __future__ import print_function
from __future__ import division
import sys
import matplotlib
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import axes3d
from tqdm import tqdm
# Load image
import imageio
# Set random seed so output is all same
np.random.seed(1)
class KMeans(object):
def __init__(self): # No need to implement
pass
def pairwise_dist(self, x, y): # [5 pts]
"""
Args:
x: N x D numpy array
y: M x D numpy array
Return:
dist: N x M array, where dist2[i, j] is the euclidean distance between
x[i, :] and y[j, :]
"""
xSumSquare = np.sum(np.square(x),axis=1);
ySumSquare = np.sum(np.square(y),axis=1);
mul = np.dot(x, y.T);
dists = np.sqrt(abs(xSumSquare[:, np.newaxis] + ySumSquare-2*mul))
return dists
def _init_centers(self, points, K, **kwargs): # [5 pts]
"""
Args:
points: NxD numpy array, where N is # points and D is the dimensionality
K: number of clusters
kwargs: any additional arguments you want
Return:
centers: K x D numpy array, the centers.
"""
row, col = points.shape
retArr = np.empty([K, col])
for number in range(K):
randIndex = np.random.randint(row)
retArr[number] = points[randIndex]
return retArr
def _update_assignment(self, centers, points): # [10 pts]
"""
Args:
centers: KxD numpy array, where K is the number of clusters, and D is the dimension
points: NxD numpy array, the observations
Return:
cluster_idx: numpy array of length N, the cluster assignment for each point
Hint: You could call pairwise_dist() function.
"""
row, col = points.shape
cluster_idx = np.empty([row])
distances = self.pairwise_dist(points, centers)
cluster_idx = np.argmin(distances, axis=1)
return cluster_idx
def _update_centers(self, old_centers, cluster_idx, points): # [10 pts]
"""
Args:
old_centers: old centers KxD numpy array, where K is the number of clusters, and D is the dimension
cluster_idx: numpy array of length N, the cluster assignment for each point
points: NxD numpy array, the observations
Return:
centers: new centers, K x D numpy array, where K is the number of clusters, and D is the dimension.
"""
K, D = old_centers.shape
new_centers = np.empty(old_centers.shape)
for i in range(K):
new_centers[i] = np.mean(points[cluster_idx == i], axis = 0)
return new_centers
def _get_loss(self, centers, cluster_idx, points): # [5 pts]
"""
Args:
centers: KxD numpy array, where K is the number of clusters, and D is the dimension
cluster_idx: numpy array of length N, the cluster assignment for each point
points: NxD numpy array, the observations
Return:
loss: a single float number, which is the objective function of KMeans.
"""
dists = self.pairwise_dist(points, centers)
loss = 0.0
N, D = points.shape
for i in range(N):
loss = loss + np.square(dists[i][cluster_idx[i]])
return loss
def __call__(self, points, K, max_iters=100, abs_tol=1e-16, rel_tol=1e-16, verbose=False, **kwargs):
"""
Args:
points: NxD numpy array, where N is # points and D is the dimensionality
K: number of clusters
max_iters: maximum number of iterations (Hint: You could change it when debugging)
abs_tol: convergence criteria w.r.t absolute change of loss
rel_tol: convergence criteria w.r.t relative change of loss
verbose: boolean to set whether method should print loss (Hint: helpful for debugging)
kwargs: any additional arguments you want
Return:
cluster assignments: Nx1 int numpy array
cluster centers: K x D numpy array, the centers
loss: final loss value of the objective function of KMeans
"""
centers = self._init_centers(points, K, **kwargs)
for it in range(max_iters):
cluster_idx = self._update_assignment(centers, points)
centers = self._update_centers(centers, cluster_idx, points)
loss = self._get_loss(centers, cluster_idx, points)
K = centers.shape[0]
if it:
diff = np.abs(prev_loss - loss)
if diff < abs_tol and diff / prev_loss < rel_tol:
break
prev_loss = loss
if verbose:
print('iter %d, loss: %.4f' % (it, loss))
return cluster_idx, centers, loss
Here is our complete K-means class implementation in Python. I encourage you to copy this code (or implement it yourself!) and run it on your own machine. This is the best way to cement your understanding of the K-means algorithm.
Let’s take a look at how to interface with our K-means class with an examples.
Performing Image Segmentation using K-means algorithm
One great practical application of the K-means application is for image segmentation. This means grouping an image into k clusters based on their color, thus reducing the image to a limited spectrum of colors. Luckily, we can achieve this with no modifications to our K-means class implementation and just a little bit more code.
The only utility function we need for this converts an input image into a matrix:
def image_to_matrix(image_file, grays=False):
"""
Convert .png image to matrix
of values.
params:
image_file = str
grays = Boolean
returns:
img = (color) np.ndarray[np.ndarray[np.ndarray[float]]]
or (grayscale) np.ndarray[np.ndarray[float]]
"""
img = plt.imread(image_file)
# in case of transparency values
if len(img.shape) == 3 and img.shape[2] > 3:
height, width, depth = img.shape
new_img = np.zeros([height, width, 3])
for r in range(height):
for c in range(width):
new_img[r, c, :] = img[r, c, 0:3]
img = np.copy(new_img)
if grays and len(img.shape) == 3:
height, width = img.shape[0:2]
new_img = np.zeros([height, width])
for r in range(height):
for c in range(width):
new_img[r, c] = img[r, c, 0]
img = new_img
return img
Now, using this function, we are ready to do image segmentation!
image_values = image_to_matrix('./images/bird_color_24.png')
r = image_values.shape[0]
c = image_values.shape[1]
ch = image_values.shape[2]
# flatten the image_values
image_values = image_values.reshape(r*c,ch)
k = 6 # feel free to change this value
cluster_idx, centers, loss = KMeans()(image_values, k)
updated_image_values = np.copy(image_values)
# assign each pixel to cluster mean
for i in range(0,k):
indices_current_cluster = np.where(cluster_idx == i)[0]
updated_image_values[indices_current_cluster] = centers[i]
updated_image_values = updated_image_values.reshape(r,c,ch)
plt.figure(None,figsize=(9,12))
plt.imshow(updated_image_values)
plt.show()
First, I load in the image I want to use and convert it to a matrix. Replace the path with any image you want to use! Then, we flatten the image from an R x C x CH array down to an (R*C) x CH, where R, C, and CH are rows, columns, and channels. We initialize our KMeans class with the first set of parenthesis, and then invoke the __call__ function with the second set of parenthesis. We pass in our data points (the image values) and the desired number of clusters.
Once the algorithm has run, the color of each pixel in the image is updated to the color of the centroid of its respective cluster. Then we output the image using matplotlib.
Choosing the Proper Number of Clusters
With the K-means algorithm, performance can vary widely depending on the number of clusters that you use. I will provide an overview of two methods for determining the best number of clusters, k, for a given dataset.
Elbow Method
In order to use the elbow method, you simply need to run your K-means algorithm multiple times, increasing the number of clusters by one each iteration. Record the loss for each iteration and then make a line graph of num clusters vs loss.
Here is a simple implementation of the elbow method:
def find_optimal_num_clusters(self, data, max_K=15):
"""
Plots loss values for different number of clusters in K-Means
Args:
image: input image of shape(H, W, 3)
max_K: number of clusters
Return:
None (plot loss values against number of clusters)
"""
y_val = np.empty(max_K)
for i in range(max_K):
cluster_idx, centers, y_val[i] = KMeans()(data, i + 1)
plt.plot(np.arange(max_K) + 1, y_val)
plt.show()
return y_val
Running this method will output a plot that looks something like what you see below:
Now, in order to choose the correct number of clusters, we perform visual inspection. Thee spot where the loss curve starts to bend is known as the elbow point. The elbow point represents a reasonable trade-off between error and the number of clusters. In this example, the elbow point is at x = 3. This means the optimal number of clusters is 3.
Silhouette Coefficient
The average silhouette of the data is another useful criterion for assessing the natural number of clusters. The silhouette of a data instance is a measure of how closely it is matched to data within its cluster and how loosely it is matched to data of the neighbouring cluster.
Silhouette value is a measure of how similar an object is to its own cluster (cohesion) compared to other clusters (separation). The silhouette ranges from −1 to +1, where a high value indicates that the object is well matched to its own cluster and poorly matched to neighboring clusters. If most objects have a high value, then the clustering configuration is appropriate. If many points have a low or negative value, then the clustering configuration may have too many or too few clusters.
I recommend using scikit-learn if you want to implement Silhouette Coefficient for cluster analysis. Visit this resource for a complete guide on implementation.
Test Your Understanding
Ponder these questions to check that you have a solid understanding of the K-means algorithm. Also checkout my list of 8 Unique Machine Learning Interview Questions on K-means.
Will different initialization of the K-means algorithm lead to different clusters?
- Yes
- No
- Sometimes
Visit this link once you think you have the answer for a brief summary.
Will the K-means algorithm always stop after a finite number of iterations?
- Yes
- No
- Sometimes
Visit this link once you think you have the answer for a brief summary.
Conclusion
In this article we have covered a lot about clustering algorithms and the python implementation of K-means. Now that you have an understanding of K-means and how to implement the algorithm, you can work on applying these skills practically.
Try to run the K-means class implementation on different types of datasets. In this article we experimented with image data, but you can use all sorts of data with K-means. If you are struggling with ideas, this resource has some great unique ideas for applications of K-means that you can try to implement.