A Gentle Introduction to Probability for Machine Learning

Introduction

Uncertainty is common to everything we do. When we throw a die, we are uncertain which number comes up. When we throw a coin, we are uncertain whether we get a head or a tail. The same thing applies in machine learning. There are situations where the machine learning algorithms work in an non-deterministic environment. In other words, you cannot confidently predict that a machine learning model will return a consistent output when fed with the same input data. This is due to the fact that probability takes a toll on the input data, the many hyperparameters to be tweaked, the complex environment of the algorithm, and many other uncertainties such as missing or probabilistic data. 

If you want to learn machine learning, probability is an important concept to understand. Although the reason may not be conspicuous when getting started, a deep dive into the inner workings of some algorithms would require a robust understanding of probability. For instance, the Naïve Bayes algorithm – a popular and powerful machine learning algorithm – was based on probability. 

In this article, we will discuss what probability is about. From a mathematical standpoint, we will touch on several concepts and theories in probability. We’d also solve some practical problems to help solidify your understanding of probability.

Also checkout my article on Linear Algebra For Machine Learning

Article Overview

What exact is Probability?

Probability is simply the likelihood for an event (or series of events) to occur. Take for instance we wish to determine the probability of a Tail when we throw an unbiased coin. Since there are only two possibilities (Head or Tail), the probability of one event (in our case, the Tail) to occur is simply ½. For a single event, the probability is usually a number between 0 and 1. While the probability may be a number greater than 1 in some scenarios, the probability of an event cannot be less than 0.

Mathematically, the probability that a random event X occurs is

p(x) = p(X = x) ≥ 0

Probability is a term defined under the following axioms

Before we go any further, let’s define some important terms that you’d come across in probability. 

  • Sample space (S): A sample space is simply the set of all possible outcomes in an experiment, whether conceptual or physical. 

The sample space could be either a finite or infinite number. An example of a finite sample space is the throwing of a die. Since the numbers can be an integer from only 1 to 6, this sample space is (1, 2, 3, 4, 5, 6) which is finite. On the other hand, an example of an infinite sample space is the temperature of a room. It could be any real number which has no limit.

  • Event (A): An event is any subset of the sample space. In the throwing of a die, for instance, seeing 1 is an event. Same as 2, 3, 4, 5, or 6.
A formal definition for sample space and event

Further Resources:
What is Probability? – Probability Course

Types of Probability distributions

There are two types of probability distribution: the continuous distribution and the fixed distribution.

Continuous distribution

Continuous variables can also be called a continuous probability distribution or probability density function. The value at a point in the curve is called likelihood or density value. An example of this situation is temperature. 

The figure below is an example of a continuous distribution function.

For a continuous probability function p(x),

\int p(x) \;dx = 1

That is, the area under the curve of a probability distribution function is 1.

Also,

\int p(x) \;dx = 1

Continuous functions can be in different distributions. Examples include:

  • Uniform Density Function: In this, there is an equal chance that any small region of event in the curve would occur. 
    Mathematically, f(x) = \frac{1}{b -a} \: \:for \:\:a\leq x\leq b 

    Otherwise, f(x) = 0
  • Gaussian or normal density function: This is the most common type of continuous function. It can be expressed as a function of two parameters: the standard deviation, σ and the mean, μ. 

    Mathematically, f(x) = \frac{1}{\sqrt{2\pi }\sigma }e^{\frac{-(x-\mu )}{2\sigma ^{2}}}

Discrete distribution

This is also known as discrete probability distribution or probability mass function. The value at a point is called the probability value. An example of a discrete variable is the throwing of a die. It can only be integers between 0 and 6.

For a discrete probability function p(x),

\sum p(x) = 1

This means that the area under the curve is 1. 

There are different types of discrete variables. Examples are:

  1. Bernoulli Distribution: The Bernoulli distribution is achieved when the event is done only once. 
    The probability of a Bernoulli distribution is given by p for x=1, or 1-p for x=0 
  1. Binomial Distribution: The binomial distribution is achieved when the event is done multiple times. For instance, finding the probability of having a tail 4 times if the die is thrown 10 times. 

    Mathematically, P(X = k) = nC^{k}p^{k}(1-p)^{n-k}
    Where k is the number of successes and n-k is the number of failures. 

Let’s take an example.

Sample Problems with Discrete Distributions

X is the random variable of throwing a die and Y is the random variable of flipping a coin and both events are done at the same time for 50 repetitions. The table below shows the joint probability table for throwing a die and flipping a coin.

Throwing a die X
X = 1X = 2X = 3X = 4X = 5X = 6Cj
Y = Head7125931147
Y = Tail3516871453
Ci101721171025100
  1. What is the probability of X = 2 and Y = Tail?

    Solution
    The probability is the number of times both occurrences occur at the same time. It is given by:

    P(X = x,\: Y=y) = N_{ij}/N

    P(X = 2,\: Y=Tail) = 5/100
  2. What is the probability of X = 6?

    Solution
    Here we do not care about whether the probability of tossing the coin is a head or a tail. So we sum up all occurrences where the die returned 6. It is given by:

    P(X = x) = C_{ij}/N

    P(X = 6) = 25/100
  3. Find the probability of X = 3 given that Y = Tail?

    Solution
    Here, we are only concerned about where X = 3. So the total number is the number of times X = 3. From the table, when X = 3 occurred while Y = Tail happened 16 times. While the total number of times X = 3 is 21. Thus the probability of X = 3 given that Y = Tail is given as 16/21.

    To recall, for a single event X, the probability of x is given by C_{i}/N

    However for multiple events X and Y, the joining probability of X and Y is given by N_{ij}/N

Sum and Product Rule

There are two key formulas that underpin other theories in joint probability. 

  1. Sum Rule: p(X) = \sum_{Y}^{} P(X,Y)
  2. Product Rule: p(X,Y) = p(Y|X)p(X)

Further Resources:
Probability Theory: Bayes Theorem, Sum Rule and Product Rule

Conditional Independence

Conditional independence is a situation where the certainty of an hypothesis is not affected by an observation. Let’s put this in simple terms. If the probability of an event A given event B and event C is the same as the probability of event A given event C alone, then event A and B are conditionally independent. 

Mathematically,

P(A|B, C) = P(A|C)

In cases where the number of events are more than two, the product rule can be applied to simplify the function better. 

Sample Problem with Conditional Independence

Here is an example of how conditional independence can be applied from ProbabilityCourse

Further Resources:
Conditional Independence — The Backbone of Bayesian Networks

Bayes Rule

Bayes rule provides a way to switch the positions of joint probabilities using the product rule. In other words, given that you know the probability of B given A and the probability of the individual events, you can determine the probability of B given A. 

By definition, the probability of A given B is given by:


How? You may ask. With the product rule, we can derive Bayes Rule ourselves as shown below

Source

This is the Bayes Rule and it is what is used to develop the Naive Bayes algorithm.

Mean and Variance 

  • The mean, also called the expectation, is defined as the weighted average of the probability distribution, with each value weighted based on its probability. For an unbiased event, the weighted average is the same as the mean. For an unbiased coin, however, the weighted average and the mean are different, since the mean value is dependent on the centre of mass. 

    Mathematically, the expectation is given by:

    E = \int g(x)p(x) dx

    Where g(x) is the weighted mean of the event and p(x) is the probability of the event. 
  • Variance on the other hand is used to determine how an event would vary when values affecting the event are picked from a probability distribution. Variance is typically derived from mean and it is given by:

    Variance = E[(f(x) - E[f(x)])^2]

Understanding Gaussian Distributions

Gaussian distribution is a commonly used distribution curve because the Gaussian distribution often models what happens in real life datasets. Also, a Gaussian distribution can be tweaked by only two variables, the mean and variance. 

A Gaussian distribution is given by:

f(x) = \frac{1}{\sigma\sqrt{2\pi}}\exp( -\frac{1}{2}(\frac{x-\mu}{\sigma})^2)

Where \mu is the mean and \sigma^2 is the variance.

Visually, a Gaussian distribution looks like this:

Properties of Gaussian Distributions

  • The linear transform of a Gaussian distribution remains a Gaussian.
    For mean E, E(AX + b) = AE(x) + b
    For covariance, Cov(AX + b) = ACov(X)A^T
  • The sum of two independent Gaussian remains a Gaussian
  • The multiplication of two Gaussian functions returns another Gaussian function

Multivariate Gaussian Distributions

If you have more than one feature, the distribution becomes a Multivariate Gaussian Distribution. The probability distribution becomes:

Source

Multivariate Gaussian distribution has a lot of applications such as in Kalman filters, Mixtures of Gaussians (MoG), Probabilistic principal components analysis (PPCA), and many more.

Central Limit Theorem

The Central Limit Theorem states that when random variables are summed up from a distribution, the resultant distribution tends towards a normal distribution as the sample size becomes larger, even if the initial distribution is not normally distributed. 

This forms a core of probability and statistics as it makes us understand the level at the sample size can cause a significant error. It can also be used to determine whether the sample size for a survey is the appropriate sample size to be used. 

Wrap Up

In this article, you have learned the theoretical concepts in probabilities and how it applies in machine learning. As a data scientist, it is important to understand these concepts as they are substantially critical in making daily decisions during statistical evaluation and inferences.

Avi Arora
Avi Arora

Avi is a Computer Science student at the Georgia Institute of Technology pursuing a Masters in Machine Learning. He is a software engineer working at Capital One, and the Co Founder of the company Octtone. His company creates software products in the Health & Wellness space.