The Most Important Things You Need To Know About Agglomerative Clustering

Other Topics — Machine Learning Interview Questions

Introduction

Not interested in background on Agglomerative Clustering? Skip to the questions here.

A lot of times AI, Artificial Intelligence, may seem cumbersome and over the top. However, things are not always as complex as that. Things that may not always seem hard can prove to be an answer during various circumstances. One such algorithm that falls in such a category here is HAC – Hierarchical Agglomerative Clustering. Now, due to its unique name that has a sense of familiarity with the word ‘clustering’ in it along with the dissimilarity in the other words, it is quite befitting to have a look into it.

Article Overview

What is Hierarchical Clustering?

Hierarchical Clustering is an unsupervised ML (Machine Learning) algorithm which is actually used to group the unlabeled datasets into a single group, and this group is called a cluster. This is also known as Hierarchical Cluster Analysis (HCA).

In this algorithm, as can be seen by its name, we try to create the hierarchy of clusters in the form of a tree. This tree-shaped structure is interestingly known as the Dendrogram.

What is Agglomerative Clustering?

Agglomerative Clustering is also known as bottom-up approach or hierarchical agglomerative clustering (HAC). HAC forms or makes a structure that is more informative than the unstructured set of clusters returned by flat clustering. This clustering algorithm amazingly does not require us to pre-specify the number of clusters, which is clearly a big plus.

Bottom-up algorithms treat each data as a singleton cluster at the outset. Then successively agglomerate pairs of clusters until all clusters have been merged into a single cluster that contains all data. 

Agglomerative Clustering ML Interview Questions/Answers

Now that we have a good idea about what Hierarchical Clustering is and more specifically, what Agglomerative Clustering is, we can move onto interview questions regarding it.

What are the two types of Hierarchical Clustering?

The two types of Hierarchical Clustering techniques are expectedly as follows:

  • The agglomerative or bottom-up approach, wherein the algorithm starts by taking all data points as single clusters and merging them until one cluster is left.
  • The divisive or top-down approach which is simply the opposite of the agglomerative approach. In this approach, we consider all the data points as a single cluster, and then we separate the data points from the cluster which are not similar after each iteration. Each data point that is separated is considered as an individual cluster. Then at the end, we are left with n number of clusters. This technique is not much used in real-world applications.

What are the steps of the Agglomerative Clustering Algorithm?

The steps are as follows:

  • First compute the proximity matrix.
  • Then let each data point be a cluster.
  • Combine the two closest clusters and accordingly update the proximity matrix. Continually repeat this step until only a single cluster remains.

What is a dendrogram in Hierarchical Clustering Algorithm?

A dendrogram can be described as a tree-like structure. This tree-like structure is mainly used to store each step as a memory that the Hierarchical Clustering Algorithm performs.

In the plotting of a dendrogram, the Y-axis represents the Euclidean distances between the observations. The X-axis then represents all the observations present in the given dataset.

Give two conditions that produce two different dendrograms using an Agglomerative Clustering Algorithm with the same dataset.

The above situation could occur due to either of the following reasons:

  • Either a change in proximity function
  • Or a change in the number of data points or variables.

These above changes will lead to different clustering results and hence of course lead to different dendrograms.

What are the advantages of Hierarchical Clustering?

Some advantages of HCA are:

  • We obtain the best/optimal number of clusters from the model itself. Therefore, human intervention is not required.
  • Dendrograms certainly help us in clear visualization. This is both  practical and easy to understand.

What are the disadvantages of Hierarchical Clustering?

Some disadvantages of HCA are:

  • HCA is not suitable for large datasets. This is due to high time and space complexity.
  • There is no mathematical objective for Hierarchical Clustering.
  • All approaches that calculate the similarity between clusters have their own disadvantages.

What are the different linkage methods used in the Hierarchical Clustering Algorithm?

There are many popular linkage methods used in Hierarchical Clustering. Some of them are as follows:

Single-linkage

In this method, the distance between two clusters is defined as the minimum distance between two data points in each cluster.

Average-linkage 

In this method, the distance between two clusters is defined as the average distance between each data point in one cluster to every data point in the other cluster.

Centroid-linkage 

In this method, we find the centroid of cluster 1 and the centroid of cluster 2 and then calculate the distance between the two before merging.

Complete-linkage

In this method, the distance between two clusters is defined as the maximum distance between two data points from each cluster.

Give the pros and cons of complete and single linkages methods in the Hierarchical Clustering Algorithm.

Pros of Single-linkage

This approach can differentiate between non-elliptical shapes as long as the gap between the two clusters is not small.

Cons of Single-linkage

This approach cannot separate clusters properly if there is noise between clusters.

Pros of Complete-linkage

This approach gives well-separating clusters if there is some kind of noise present between clusters.

Cons of Complete-Linkage

This approach is biased towards globular clusters.
It tends to break large clusters.

Wrap Up

We can clearly see the importance of Hierarchical Clustering Algorithms and more specifically the Agglomerative Clustering Algorithm which is a subset or a part of them in this article. Rest assured questions regarding it should certainly be expected in an interview. On top of this, it is clearly something that one would benefit from knowing about. Interestingly enough, this article also ensures the lack of usability of the Divisive Clustering Algorithm. It also gives some pointers into why the Hierarchical Approach is better than the usual Flat Clustering Approaches out there.

Avi Arora
Avi Arora

Avi is a Computer Science student at the Georgia Institute of Technology pursuing a Masters in Machine Learning. He is a software engineer working at Capital One, and the Co Founder of the company Octtone. His company creates software products in the Health & Wellness space.