Other Topics — Machine Learning Interview Questions
Introduction
Not interested in background? Skip to the questions here.
The decision trees that we have all come to adore and abide by having their shortcomings. It is natural for an algorithm to suffer from that. However, we as Machine Learning/Artificial Intelligence Engineers or Data Scientists decided not to stop there itself, look at those short incomings right in the eye, and eventually conquer them by not giving up and changing the entire algorithm but building on top of it and bringing innovation to it to seal the crack that caused the issue. In doing so, we discovered the usual suspects, such as random forests, but today, we shall discuss GBDTs (Gradient Boosted Decision Trees). The name itself implies an optimization or enhancement of something initial, does it not?
Article Overview
- What is a Gradient Boosted Decision Tree?
- How does a Gradient Boosted Decision Tree work?
- Why do we need Gradient Boosted Decision Trees?
- Gradient Boosted Decision Tree ML Interview Questions & Answers
What is a Gradient Boosted Decision Tree?
A gradient boosted decision tree is yet another algorithm based on ensemble learning. However, it utilizes boosting (obviously) instead of its counterpart, bagging. Therefore, weak learners are converted to strong learners in a GBDT.
So, a GBDT is a decision tree based on the technique of producing an additive predictive model by combining decision trees that, of course, are weak predictors on their own. Therefore, GBDTs can be used for classifying or regressing, depending on the type of problem.
Isn’t that interesting?
How does a Gradient Boosted Decision Tree work?
Our beloved GDBTs work on the following steps, and it is a necessity that we study each of them properly to grasp an apt understanding of how they function:
Classification
- Initial Prediction
- Calculate Residuals
- Predict residuals by building a decision tree
- Obtain new probability
- Obtain new residuals
- Repeat steps 3 to 5 until the residuals converge to 0 or the number of iterations becomes equal to the required hyperparameter (number of estimators/decision trees) given
- Final Computation
Regression
- Calculate the average of the target label
- Calculate the residuals
- Predict residuals by building a decision tree
- Predict the target label using all the trees within the ensemble
- Compute the new residuals
- Repeat steps 3 to 5 until the residuals converge to 0 or the number of iterations becomes equal to the required hyperparameter (number of estimators/decision trees) given
- After training is done, use all the trees to make a final prediction as to the value of the target variable
Why do we need Gradient Boosted Decision Trees?
Well, we harken back to the exact reasoning why we decided to use Random Forests as the reason for GBDTs’ usage is pretty similar, get rid of overfitting. However, GBDTs move some steps further. Usually, GBDTs get a more accurate result than other models, and they are excellent with unbalanced data, making their usage a no-brainer!
Gradient Boosted Decision Trees ML Interview Questions/Answers
The machine learning questions on GBDT’s are listed below. Try to answer them in your head before clicking the arrow to reveal the answer
What elements do GBDTs involve?
Gradient boosting involves three elements:
- A loss function to needing optimization.
- A weak learner (decision tree) for making predictions.
- An additive model to add weak learners to minimize the loss function.
How would you define Gradient Boosting?
Gradient Boosting is an algorithm that relies on the intuition that the best possible next model, when combined with previous models, minimizes the overall prediction error. Here, the key idea is to set the target outcomes for the next model in order to minimize the error.
What are the types of boosting algorithms?
Well, mainly, there are three such algorithms, and they are as follows:
- AdaBoost (Adaptive Boosting) algorithm
- Gradient Boosting algorithm
- XGB (Extreme Gradient Boosting) algorithm
How can we improve Gradient Boosted Decision Trees?
- There are various ways to improve the performance of GBDTs, but in general, we can do the following:
- Pick a lower learning rate (Shrinkage), between 0.1 to 0.3.
- Having Tree Constraints on Number of Trees, Tree depth, Minimum Improvement in Loss, and Number of Observations per Split
- This is a given but lower the learning rate and increase the number of decision trees/estimators proportionally to achieve models that are more robust in nature
- Establishing Penalized Learning
- Implementing Random Sampling
- Utilizing Regularization
What are the advantages of GBDTs?
The advantages of GBTS are:
- They provide a degree of predictive accuracy that is seldom ever matched by other models.
- They have loads of flexibility in the sense that various loss functions can be employed within them, and of course, their hyperparameter tuning is totally up for grabs as well, depending on the problem being tackled.
- Outstandingly, they do not require data pre-processing, so they are excellent with the categorical or numerical data as given.
- Obviously, they handle missing data as well.
What are the disadvantages of GBDTs?
The disadvantages of GBDTS are:
- Since they persist in trying to minimize all errors, they tend to overemphasize outliers and cause overfitting.
- Quite power-hungry as many decision trees and henceforth, memory may be needed.
- Time-consuming definitely, due to the above and the high flexibility in the tuning of parameters, as much more testing is needed.
- They are clearly complex and hard to interpret.
Keep in mind that most disadvantages above have fixes to certain degrees.
What are the differences between GBDTs and Random Forests?
Their differences are:
- Gradient Boosted Decision Trees are more prone to overfitting if given data is noisy.
- Boosting takes longer to train since their decision trees are built sequentially.
- GBDTs are harder to tune.
- GBDTs utilize weak learners (high bias, low variance), meaning it uses shallow decision trees
- Random Forests are more prone to being biased.
- RF do not using a step-by-step approach, hence do not deal well with unbalanced datasets.
- RF utilizes fully grown decision trees (low bias, high variance)
What are the differences between GBDTs and Adaptive Boosted (AdaBoosted) Decision Trees?
Their differences include the following:
- GBDTs train learners based upon minimizing the loss function of a learner while AdaBoosted DTs train by concentrating on misclassified observations.
- Weak learners in AdaBoosted DTs are a very basic form of decision trees known as stumps whilst those of GBDTs are deeper with more levels.
- All the learners in GBDTs have equal weights. However, in AdaBoosted DTs, the final prediction is based on a majority vote of the weak learners’ predictions weighted by their individual accuracy.
Do outliers have an effect on Decision Trees?
CART paradigms, such as decision trees, are resistant to outliers because decision trees split things into lines and do not distinguish how distant a point is from a line. In general, the nodes are defined by the sample proportions in each split zone (and not by their absolute values).
Do we need feature scaling for Decision Trees?
Tree-based algorithms are somewhat insensitive to feature size. The decision tree divides a node based on a property that enhances the node’s homogeneity. This split on a feature is unaffected by other features, resulting in scaling invariance. Decision Trees offer basic categorization rules based on if and else statements that may be applied manually if necessary.
What are the assumptions of Decision Trees?
- Being a CART algorithm, we need to decide which feature we should split on
- If we do find the feature to split on, what value of that feature is best to make the split for?
- The goal will then be to find the best features and values which separates our examples by labelling the most
- Basically for each feature, we have to figure out the best value.
What is Gini Impurity?
One of the strategies used in decision tree algorithms to determine the ideal split from a root node and subsequent splits is the Gini impurity measure. Gini Impurity informs us how likely it is that an observation may be misclassified. The smaller the Gini, the better the split, and the less likely to be misclassified.
What is Entropy in Decision Trees?
In layman’s terms, entropy is the degree of disorderliness in a system. In machine learning, it is similar. It is used to measure how the decision tree splits the samples into classes. Technically speaking, you could say entropy is a metric used to evaluate the impurity or uncertainty in a set of data. Simply put, it controls how a decision tree splits data. Entropy always lies between 0 and 1 and an entropy greater than 0.8 is considered high.
What are the merits and demerits of Decision Trees?
(in general, not just GBDTs)
Some advantages and disadvantages of decision trees are listed below:
Advantages
- A decision tree does not need data normalisation or scaling.
- Decision trees necessitate less work for data preparation during pre-processing.
Disadvantages
- A slight change in the data might induce a big change in the structure of the decision tree, resulting in instability.
- Decision tree training is extremely expensive due to the complexity and time required.
Outline the different types of nodes in Decision Trees.
- The root node is the top-most node of the Tree, from which the Tree grows.
- Decision nodes: One or more Decision nodes that result in the division of data into different data segments, with the primary purpose of having the offspring nodes with the greatest homogeneity or purity.
- Leaf nodes: These nodes reflect the data segment with the greatest uniformity.
Wrap Up
Alright then, yet another excellent approach to resolving the everyday Machine Learning problems. The degree of control that can be exhibited with GBDTs, the way almost all their disadvantages can be taken care of using various maneuvers, the way they are able to deal with unbalanced data, and not to forget their excellent level accuracy, triumphant above most other models. All of these make their usage not only worthwhile but also a definite necessity in many real-world problems. Don’t you think?