Other Topics — Machine Learning Interview Questions
Introduction
Not interested in background on Anomaly Detection? Skip to the questions here.
The biggest thing holding Machine Learning models back in this day and age is the presence of errors. So as time has evolved, models have too. Various optimizations have been either made to the older models, or the older models themselves have been replaced with newer ones. One of the biggest problems that occur in making models accurate is giving them the ability to detect anomalies.
Article Overview
- What is Anomaly Detection?
- Why is Anomaly Detection Important
- Anomaly Detection ML Interview Q&A
- Wrap Up
What Does Anomaly Detection Mean?
Anomaly detection means being able to recognize outliers in a dataset. Outliers or anomalies are those data points that do not follow the general trend of the rest of the dataset. In ML, anomaly detection refers to models being able to distinguish between these data points and the others so that they are neither trained on them nor confused by them in practical life scenarios.
Why is Anomaly Detection Important?
Anomaly detection is significant since it assists in removing outliers from a dataset. ML models must be able to recognize outliers so that they do not get trained on them. The reason for this is that outliers can erroneously skew the results of an ML model. Henceforth, decisions based on such an ML model could end up producing poor data analysis and damaging a company’s sales or a robot’s working, for example. Therefore, anomalies can produce a make-or-break instance in the lifecycle of a business or perhaps hurt someone.
Anomaly Detection ML Interview Questions/Answers
Now that we know what anomaly detection is, along with its importance, let us have a look at interview questions related to it. It is pretty evident that questions related to it will indeed be asked. Try to answer them in your head before clicking the arrow to reveal the answer.
In uniform distribution, the mean and standard deviation merely characterize the range of values. A possible indication of anomalous behavior could be that a small neighborhood contains substantially fewer or more data points than expected from a uniform distribution.
In a normal distribution, the empirical rule, which states that 68%, 95%, and 99.7% of the values lie within one, two, and three standard deviations of the mean, is followed. Hence, a threshold (such as 3 times the standard deviation) is chosen, and points beyond that distance from the mean are declared to be anomalous.
They can be placed in the following types:
1) Point anomalies: which are also known as global anomalies and refer to a single instance of data being anomalous if it’s too far off from the others.
Use case: for detecting credit card fraud based on “amount spent.”
2) Contextual anomalies: which are also known as conditional anomalies and consist of an abnormality that is context-specific. Such anomalies occur commonly in time-series data.
Use case: where $100 being spent on food every day during the holiday season is normal, but maybe odd otherwise.
3) Collective anomalies: refers to the types of anomalies that exist as a set of data points that are anomalous to the entire dataset.
Use case: when someone tries to copy data from a remote machine to a local host in an uncalled fashion, an anomaly would be flagged as a potential cyber attack.
The three types of outlier detection are:
1) Supervised: which requires completely labeled training and testing datasets. An ordinary classifier is trained first and applied afterward.
2) Semi-supervised: this utilizes both training and test datasets, where training data only consists of normal data without any outliers. A model of the normal class is learned, and outliers can then be detected if they deviate from that model.
3) Unsupervised: which simply does not require any labels, and there is no distinction between training and test datasets here. Data is scored solely based on the intrinsic properties of the dataset.
The three approaches to detect anomalies are:
1) By Density – Normal points occur in dense regions, while anomalies occur in sparse regions
2) By Distance – Normal point is close to its neighbors, and the anomaly is far from its neighbors
3) By Isolation – The term isolation means ‘separating an instance from the rest of the instances.’ Since anomalies are ‘few and different’ and therefore they are more susceptible to isolation.
Not it cannot. since it is not built for that purpose. It will end up giving a solution that minimizes the total within-cluster sum of squares, and the outliers will not necessarily define their own cluster.
Normalization is a process that rescales the values into a range of 0 to 1. The outliers from the data set are therefore lost.
Standardization is a process that rescales data to have a mean (μ) of 0 and a standard deviation (σ) of 1 (unit variance). This, therefore, retains the outliers and is recommended for most applications.
One of the best algorithm for this use case is the Support Vector Machine algorithm. They have a shorter training time and better accuracy than the other algorithms. However, arguments can be made for other algorithms, especially given different constraints.
Some other algorithms that can be used for anomaly detection are:
1) Neural Networks
2) K nearest neighbor
3) Local Outlier Factor
Wrap Up
Anomaly detection is clearly a very vital part of Machine Learning. It plays a role during the training and making of every model and algorithm. ML models would not be successful at all if they would not have found ways to cater for anomalies, and their accuracy would never have reached a safe enough level in any application or any insight prediction.