Other Topics — Machine Learning Interview Questions
Introduction
Not interested in background on K Nearest Neighbors? Skip to the questions here.
Machine Learning is a lot about research, and with that research come algorithms made for various purposes. One such algorithm that we are going to have a look at today is K Nearest Neighbors. It is yet another algorithm that seems to function very much according to what a human would find intuitive and appeasing. Henceforth, it makes perfect sense for us to move on to questions relating to it as well.
Article Overview
What is the K Nearest Neighbors Algorithm?
K Nearest Neighbors, also known as KNN is a non-parametric supervised learning technique that interestingly can be used to address both classification and regression problem statements.
It models a function to create an output for the unseen data using data that has a target column present (labeled data). For classification or prediction, the distance between the data points is calculated using the Euclidean distance formula.
How does the KNN Algorithm Work?
The KNN classifier or algorithm needs to perform the following for each unknown or test data point:
Step 1: Determine and store the lengths between every point in the training set and the test point
Step 2: Sort the calculated distances in ascending order
Step 3: Store the K nearest points from our training dataset
Step 4: Work out the proportions of each class to the point
Step 5: Assign the class that has the highest proportion to the point in question
K Nearest Neighbors ML Interview Questions/Answers
As we have now managed to learn what the K Nearest Neighbors algorithm is and how it works or the steps involved in it from a basic standpoint. Let us then proceed towards the questions relating to it as well.
Why is KNN a Non Parametric Algorithm?
Making no assumptions about the distribution of the underlying data is what is meant by the term “non-parametric.” The number of parameters in the model for these methods is not fixed.
KNN considers each training example as a parameter, allowing the model parameters to expand or grow with the training data. The KNN algorithm is hence non-parametric.
Does the Need for Feature Scaling Exist with KNN?
Feature scaling is actually necessary for the KNN algorithm to perform better or optimally.
Imagine, for illustration, a dataset with N features and N instances. There is one feature whose values range from 0 to 1. A feature that ranges from -999 to 999 is also present. When these values are used as a substitute in the Euclidean Distance formula, the performance will be impacted since variables with larger magnitudes will be given more weight.
Can the K Nearest Neighbor Algorithm be used for Regression Problems?
KNN may be used to solve regression problems, definitely.
In other words, if the dependent variable is continuous, the KNN algorithm can be used. The predicted value for regression problem statements is determined by averaging the values of the k nearest neighbors.
Why is the KNN Algorithm Considered a “Lazy Learner?”
The KNN method simply saves the training data when it receives it; it does not learn and create a model. Instead of using the training data to discover any discriminative functions, instance-based learning is used, and the training data is also used for making predictions on datasets that have not yet been viewed.
As a result, KNN is referred to as a “Lazy Learner” since it delays learning a model instead of doing so right away.
What is the Time Complexity of the KNN Algorithm?
We can get a rough estimate of the KNN runtime by following this logic:
The distances must be calculated with quadratic time complexity, and the calculated distances are sorted with O(N log N) time. The process is of course an O(N^3 log N) process, which is a colossally lengthy operation, if you really look at it.
This is likely a sufficient demonstration of your understanding in an interview context, however it does not take into account the parameter K, nor any optimizations that can be made such as k-d trees.
See a more complete explanation of the K Nearest Neighbors time complexity here.
What is the Space Complexity of the KNN Algorithm?
Memory is a problem since KNN sorts and stores all pairwise distances in computer memory. Depending on the dataset obviously, the space complexity increases. Usually, if we have really large datasets, local machines will crash.
How can Categorical Variables be handled in the KNN Algorithm?
To handle categorical variables, dummy variables made from the original category variable must be used in place of the original categorical variable. Contrary to regression, make k dummies (k-1).
As an illustration, the categorical variable “Degree” has 5 distinct levels or categories. Therefore, we’ll make 5 dummy variables. For each dummy variable’s degree, 1 is present, and otherwise, 0.
How can the Bias-Variance Tradeoff be Related to the KNN Algorithm?
The main issue with small values of K is that because of how much more influence noise has on the outcome, which also creates a huge variance in predictions, small values of K should be avoided at all costs.
The accuracy increases as the K value increases. Our model is poorly suited if K is too large. As a result, the error will increase once more. Therefore, to avoid under-fitting, your model should maintain its generalization capabilities. Otherwise, there is a potential that it will work well on training data but fall flat on real data. If we select an extremely large value for k, the algorithm’s computational cost likewise rises.
- As k increases, the bias will increase
- As k decreases, the variance will increase
- On must know that with the increasing value of K, the boundary becomes smoother
Henceforth, there is a tradeoff between overfitting and underfitting. So you have to maintain a balance while choosing the value of K in KNN. Therefore, K should not be too small or too large.
Conclusion
To summarize it all, we have certainly learned a lot from the questions above. They also get into the nitty gritties of the KNN algorithm. It is quite remarkable to see how technical it can actually turn out to be, albeit the fact that it is so simple to understand. So there is certainly a high possibility for one to be questioned about such an algorithm in Machine Learning interviews!