16 Unique Machine Learning Interview Questions on K-means

Two examples of clustering that could be produced with K-means

Introduction

Not interested in background on K-means clustering? Skip to the questions here.

In preparing for your next Machine Learning Interview, one of the topics you certainly need to be familiar with is K-means. This algorithm is incredibly useful for clustering data points into groups that have not been explicitly labeled! The days of Machine Learning taking over the world are well within their stride, so it is important to have a solid grasp on concepts such as this one.

Check out my in-depth tutorial on implementing K-means from scratch in python if you are not familiar with the algorithm!

This series of articles is meant to equip you with the knowledge you need to ace your ML interview and secure a top tier job in the field. Once you decide you join in the hunt to master this craft, companies will come after you to get you!

Article Overview

What is K-means clustering?
How does K-means clustering work?
Why is K-means clustering important?
K-means ML Interview Questions and Answers
Wrap Up

What is K-means clustering?

It is a very popular unsupervised clustering algorithm. It does not require labels to train the model, it only needs data.

Its main objective is to cluster data points that have similar properties into certain groups (k number of groups) to discover underlying structures and patterns of the dataset. The name k-means is given because it will cluster data into k groups which is given to the algorithm. In this algorithm, “k” is a hyperparameter and its optimal value needs to be figured out by trying different values of k.

How does K-means clustering work?

It calculates the centroid for each k cluster on dataset and tries to find the optimal centroid for every cluster in successive epochs. After learning optimal centroid values, these centroids are used to classify new data points into classes that have similar properties to it.

Why is K-means clustering important?

Clustering is a very important technique used to study data and make use of unlabeled data. It is also free from human bias as we are unable to capture every dependency for datasets containing millions of data points. In such scenarios, clustering comes handy. You may get confused in K means and K nearest neighbors due to the similarity in their names. K nearest neighbor algorithm is a supervised learning algorithm which is one of their biggest difference.

K-means ML Interview Questions and Answers

Some potential interview questions on K means clustering that you may encounter are shown below. Try to answer them in your head before clicking the arrow to reveal the answer.

What is the core idea of K-means algorithm?

K-means algorithm cluster similar data points using the centroid of each cluster. New data points are classified using these centroids, whichever cluster it fits in, it belongs to that class.

How does the K-means algorithm work?

The goal of our K-means is to organize our data into K-distinct groups. If for instance, we chose k = 2, then what we want is for K-means to accomplish is to separate these data into 2 groups

First step is to pick K random points (here, 2) and assign them to different labels. These points are called centroids
Next we find the distance between centroid 1 and every other data point there is
We repeat the previous process for centroid 2 as well. Whichever centroid is closest to a given point, that point will take on that centroid label.
Then what we do is update the centroid to be the average of all the points in the given cluster
We repeat steps 2-4 over and over again until centroid locations do not change after they have been updated. This means our K-means have converged.

What assumptions are used for K-means?

K-means assumes all the clusters are continuous and the variables have symmetric distribution. It does not work well with skewed dataset. So, we must assume these variables should have similar mean and variance to use it.

Is K-means supervised or unsupervised learning?

K means follows unsupervised machine learning approach as it does not require labels along with data to train model.

What is K in K-means algorithm and what is its significance?

Here in k means, k refers to the number of clusters in which all the data points will be grouped. K is a hyperparameter and is specific to every data set. The performance of the algorithm depends on the value of K. For the optimal value of k, it performs best and for remaining values. Its performance fluctuates between the best and the worst depending on the properties of dataset.

Which evaluation methods are used with K-means algorithm?

We use Dunn Index and Inertia to evaluate the performance of k means algorithm.

What are some of the pros of the K-means clustering algorithm?

It is a simple approach and it converges to a local minimum. It works well with big datasets as well. Also, it can easily adapt to new examples. Additionally, it can cluster data points that are distributed in spherical clusters. Its results are easy to interpret.

What are some of the cons of K-means clustering algorithm?

Its biggest con is finding the optimal value of k. K’s value can only be examined manually and it can be done using the hit or trial method or searching values for k in a given range. It can be time-consuming.

Another issue is the curse of dimensionality, as number of dimensions increase, it becomes harder to cluster data. It can not classify overlapping data points efficiently. Very large dataset may cause trouble and it is also prone to outliers and noisy data.

Additionally, K-means can not accommodate clusters that are elliptical or other complex shapes. It struggles with datasets that have clusters of difference sizes since the cluster variance is completely ignored.

What are some applications of K-means algorithm?

It is used in customer segmentation to cluster customers on basis of past records. It is used in banking, telecommunication, sales, advertisement etc. to cluster similar customers.

Additionally, It can be used to build recommendation systems which recommend personalized services and content to users.

K-means is also used for image segmentation tasks. It is also used to filter spam and for fraud detection. It is also used to build search engines and document clustering.

What is the Dunn Index?

The Dunn Index is a clustering outcome measure. A greater value is preferable. It is calculated as the smallest intercluster distance (that is, the shortest distance between any two cluster centroids) divided by the greatest intracluster distance (i.e. the greatest possible distance between any two locations in any cluster)

How do we figure out what the ideal value of the final K is?

Elbow method: This involves plotting the inertia against all of the different Ks that we can pick. As a rule of thumb, the best K is typically the number of clusters whose inertia differential to the next is low. You are on the lookout for where the curve becomes less steep. In the elbow plot below, the curve became less steep after cluster 5. Thus, the best value for K is 5.

K Means Clustering | K Means Clustering Algorithm in Machine Learning — *Source:* *Analytics Vidhya*

Silhouette Method: This takes into account the ratio of the inter and intra clusters distances. A silhouette curve can also be drawn just like the elbow curve. Just that this time, the curve is inverted and the best K is the point at which the curve becomes very steep.

What are the variants of K-means?

K-medians: here we use Manhattan distance instead of Euclidean distance
K-modes: This variant is useful if we don’t have numerical data. It gauges distance on the dissimilarities of the examples and it uses the modes of the most recurring elements in those dissimilarities to organize these clusters.

What do we do when our data clusters are centered within each other?

If we use K-means, it would split the data from the middle. Whereas, normally, the outer circle should be one category and the inner circle to be another category. The algorithm to help solve this problem is Agglomerative Clustering.

What is K Means++ algorithm?

The k-means++ algorithm addresses the second of these obstacles of bad approximation with respect to an objective function by specifying a procedure to initialize the cluster centers before proceeding with the standard k-means optimization iterations. It is primarily used for selecting the best initial value for k.

Differentiate between K-means and K Nearest Neighbors

k-Nearest Neighbors is a classification (or regression) algorithm. The basic idea is to figure out what an unlabeled example is according to its neighbors. It is a supervised learning algorithm. On the other hand, k-Means is an unsupervised learning clustering approach that attempts to split a set of points into k groups so that the points in each cluster are close to each other.

So the major difference is the K-Means clustering is an unsupervised learning algorithm while K-nearest neighbor is a supervised machine learning algorithm.

What is the difference between the Manhattan Distance and Euclidean Distance in Clustering?

The Manhattan distance calculates the distance between two places by averaging the paired absolute differences between each variable.
Euclidean distance captures the distance between two points by aggregating the squared difference in each variable.

Manhattan distance is typically used when you wish to calculate the distance between two points in grids. For instance, a city, where houses are in blocks. Euclidean distance however is used to measure the shortest distance between two points. For instance, the distance between two trees on a plain field.

Wrap Up

K-means is a very useful unsupervised approach in Machine Learning. It is widely used in industry to get meaningful insights from Big Data. As organizations have data of millions of users, using unsupervised algorithms like K-means, they cluster their customers in various groups to provide better services to customers and use the Big Data efficiently.