Probability Basics
Abstract:
Probability distributions are fundamental in data science for purposes like building machine learning models, modeling uncertainty, and guiding informed decision-making. They define how the values of a set of random variables are distributed. Broadly, these distributions are classified into discrete and continuous types. Below is an overview of key probability distributions, along with their characteristics and common applications.
Discrete Probability Distributions
Discrete distributions apply to scenarios where the set of possible outcomes is countable. Key discrete distributions include:
1. Bernoulli Distribution :
The Bernoulli distribution models a single trial with exactly two possible outcomes:
- Success with probability p
- Failure with probability 1 – p
It is the simplest type of probability distribution and serves as the foundation for more complex distributions like the binomial distribution. In machine learning, it is closely related to the logistic regression.
Probability mass function:
\[ P(X = x) = p^x (1 - p)^{1 - x}, \quad x \in \{0, 1\} \]Example 1: Coin Toss
A fair coin toss can be modeled using a Bernoulli distribution with
p = 0.5
, where:
- Success = Heads
- Failure = Tails
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import bernoulli
# Define the probability of success (e.g., heads)
p = 0.5
x = [0, 1] # 0 = failure, 1 = success
# Get probability mass function values
pmf = bernoulli.pmf(x, p)
# Plot
plt.figure()
plt.bar(x, pmf, tick_label=["Failure (0)", "Success (1)"])
plt.title(f'Bernoulli Distribution (p = {p})')
plt.ylabel('Probability')
plt.xlabel('Outcome')
plt.ylim(0, 1)
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.tight_layout()
plt.show()
Example 2: Ad Click
A website visitor has a 30% chance of clicking on an ad. What is the probability that a randomly selected visitor clicks the ad?
This is a Bernoulli trial : - Only one trial (a single visitor), - Two possible outcomes: click (success = 1) or no click (failure = 0), - Probability of success \( p = 0.3 \).
from scipy.stats import bernoulli
# Probability of success
p = 0.3
# Bernoulli distribution
click_prob = bernoulli.pmf(1, p) # P(click)
no_click_prob = bernoulli.pmf(0, p) # P(no click)
print(f"Probability the visitor clicks the ad: {click_prob:.2f}")
print(f"Probability the visitor does NOT click the ad: {no_click_prob:.2f}")
2. Binomial Distribution :
The binomial distribution models the number of successes in a fixed number of independent Bernoulli trials , where each trial has the same probability of success p .
- n = number of trials
- p = probability of success in each trial
- k = number of successes (0 to n)
Probability mass function:
\[ P(X = k) = \binom{n}{k} p^k (1 - p)^{n - k} \]Example 1: 10 Coin Flips
If you flip a fair coin 10 times ( n = 10 , p = 0.5 ), the binomial distribution models the probability of getting k heads, where k can range from 0 to 10.
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import binom
# Parameters
n = 10 # number of trials
p = 0.5 # probability of success
x = np.arange(0, n+1)
# Probability mass function
pmf = binom.pmf(x, n, p)
# Plot
plt.figure()
plt.bar(x, pmf, color='skyblue')
plt.title(f'Binomial Distribution (n = {n}, p = {p})')
plt.xlabel('Number of Successes (Heads)')
plt.ylabel('Probability')
plt.xticks(x)
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.tight_layout()
plt.show()
Example 2:
A basketball player has a free-throw success rate of 80%. During a game, she takes 10 free throws. What is the probability that she makes exactly 8 of them?
This is a classic binomial probability problem where: - Each free throw is an independent trial, - The probability of success \( p = 0.8 \), - The number of trials \( n = 10 \), - We are interested in exactly \( k = 8 \) successes.
from scipy.stats import binom
# Parameters
n = 10 # number of trials
p = 0.8 # probability of success
k = 8 # desired number of successes
# Binomial probability
probability = binom.pmf(k, n, p)
print(f"Probability of exactly {k} successful free throws out of {n}: {probability:.4f}")
3. Poisson Distribution :
The Poisson distribution models the number of events occurring in a fixed interval of time or space , under the following assumptions:
- Events occur independently of one another
- The average rate of occurrence (λ) is constant
- Two events cannot occur at exactly the same instant
It’s useful for modeling count-based events over time, such as system failures, web traffic, or call arrivals.
Probability mass function:
\[ P(X = k) = \frac{\mu^k e^{-\mu}}{k!}, \quad k = 0, 1, 2, \dots \]Example: Call Center
Suppose a call center receives on average 4 calls per hour ( \(\mu\) = 4 ). The Poisson distribution can model the probability of receiving 0, 1, 2, ..., k calls in any given hour.
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import poisson
# Parameters
mu = 4 # average number of calls per hour
x = np.arange(0, 15)
# Probability mass function
pmf = poisson.pmf(x, mu)
# Plot
plt.figure()
plt.bar(x, pmf, color='salmon')
plt.title(f'Poisson Distribution (λ = {λ})')
plt.xlabel('Number of Calls per Hour')
plt.ylabel('Probability')
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.tight_layout()
plt.show()
4. Discrete Uniform Distribution :
The discrete uniform distribution assigns equal probability to all outcomes in a finite set . It is used when there is no prior reason to favor one outcome over another—each outcome is equally likely.
-
For a set of
n
possible outcomes:
\( P(X = x_i) = \frac{1}{n} \)
for all in the set.
Example: Fair Die Roll
Rolling a fair six-sided die is a classic example. Each face (1 through 6) has a probability of \( \frac{1}{6} \).
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import randint
# Parameters
low, high = 1, 7 # randint is [low, high), so high = 7 for die faces 1 to 6
x = np.arange(low, high)
# Probability mass function
pmf = randint.pmf(x, low, high)
# Plot
plt.figure()
plt.bar(x, pmf, color='lightgreen')
plt.title('Discrete Uniform Distribution (Fair 6-Sided Die)')
plt.xlabel('Die Face')
plt.ylabel('Probability')
plt.xticks(x)
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.tight_layout()
plt.show()
Continuous Probability Distributions
A continuous probability distribution describes the probabilities of a continuous random variable , which can take any real value in a given range.
Key Features:
- It is described by a Probability Density Function (PDF) \( f(x) \).
-
Probabilities are calculated as areas under the curve of the PDF: \[ P(a \leq X \leq b) = \int_a^b f(x) \, dx \]
-
The probability of a single exact value is zero: \[ P(X = x) = 0 \]
🔁 Key Differences Between Discrete and Continuous Distributions
Feature | Discrete | Continuous |
---|---|---|
Possible values | Countable (e.g., 0, 1, 2, ...) | Uncountable (e.g., any real number) |
Described by | PMF (Probability Mass Function) | PDF (Probability Density Function) |
Probability of exact value | \( > 0 \) | \( = 0 \) |
Probability computed by | Summing probabilities | Integrating density over an interval |
1. Normal Distribution :
The normal distribution , also known as the Gaussian distribution , is characterized by its:
- Symmetric, bell-shaped curve
- Centered at the mean (μ)
- Spread determined by the standard deviation (σ)
In physics, it is the Lagrangian (high energy physics), or effective free energy (stat mech) of a non-interactive field. In machine learning, it is closely related to linear regression.
It is foundational in statistics and machine learning due to the Central Limit Theorem , which states that the sum of many independent random variables tends toward a normal distribution.
Central Limit Theorem is closely related to the Landau effective field theory. In short, we can expand a general Lagrangian or effective free energy around the minimum and keep the lowest order, which leads to a Gaussian distribution.
Example: Human Heights
Human heights (in a population) often follow a normal distribution with a specific mean and standard deviation.
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm
# Parameters
mu = 170 # mean height (e.g., cm)
sigma = 10 # standard deviation
x = np.linspace(mu - 4*sigma, mu + 4*sigma, 500)
pdf = norm.pdf(x, mu, sigma)
# Plot
plt.figure()
plt.plot(x, pdf, linewidth=2)
plt.title(f'Normal Distribution (μ = {mu}, σ = {sigma})')
plt.xlabel('Height (cm)')
plt.ylabel('Probability Density')
plt.grid(True, linestyle='--', alpha=0.7)
plt.tight_layout()
plt.show()
2. Exponential Distribution :
The exponential distribution is a continuous probability distribution used to model the time between consecutive events in a Poisson process , where:
- Events occur independently
- At a constant average rate (λ) over time
It is commonly used to model waiting times , such as the time until a machine fails or the time until the next earthquake.
-
PDF:
\[ f(x; \lambda) = \lambda e^{-\lambda x} \quad \text{for } x \geq 0 \]
Example: Time Between Earthquakes
If earthquakes occur on average once every 10 days, the rate is \( \lambda = \frac{1}{10} = 0.1 \) per day.
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import expon
# Parameters
λ = 0.1 # average rate (1/mean time)
scale = 1 / λ # scale = 1 / λ
x = np.linspace(0, 80, 500)
pdf = expon.pdf(x, scale=scale)
# Plot
plt.figure()
plt.plot(x, pdf, linewidth=2)
plt.title(f'Exponential Distribution (λ = {λ}, mean = {scale})')
plt.xlabel('Time Until Next Earthquake (days)')
plt.ylabel('Probability Density')
plt.grid(True, linestyle='--', alpha=0.7)
plt.tight_layout()
plt.show()
3. Gamma Distribution :
The gamma distribution generalizes the exponential distribution . While the exponential models the waiting time until the first event in a Poisson process, the gamma distribution models the waiting time until the kᵗʰ event .
The probability density function (PDF) of the Gamma distribution is defined as:
\[ f(x; \alpha, \beta) = \frac{\beta^\alpha}{\Gamma(\alpha)} x^{\alpha - 1} e^{-\beta x}, \quad \text{for } x > 0 \]where:
- \( \alpha > 0 \) is the shape parameter: number of events
- \( \beta > 0 \) is the rate parameter (sometimes you might see scale = \(1/\beta \), the scale parameter),
- \( \Gamma(\alpha) \) is the gamma function
Example: Time Until 3 Earthquakes
If earthquakes occur at a rate of 1 every 10 days (λ = 0.1), the gamma distribution models the time until 3 earthquakes occur.
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import gamma
# Parameters
a = 3 # shape (number of events)
λ = 0.1 # rate
scale = 1 / λ # scale = 1 / λ
x = np.linspace(0, 100, 500)
pdf = gamma.pdf(x, a, scale)
# Plot
plt.figure()
plt.plot(x, pdf, linewidth=2)
plt.title(f'Gamma Distribution (alpha = {a}, λ = {λ}, mean = {a/λ})')
plt.xlabel('Time Until 3 Earthquakes (days)')
plt.ylabel('Probability Density')
plt.grid(True, linestyle='--', alpha=0.7)
plt.tight_layout()
plt.show()
4. Beta Distribution :
The beta distribution is defined on the interval \(x \in [0, 1]\). It is commonly used to model:
- Proportions (e.g., conversion rates, success probabilities)
- Probabilities of probabilities in Bayesian statistics , where it often serves as a conjugate prior for the binomial distribution.
It is parameterized by:
-
α (first shape parameter)
: number of successes + 1
-
β (second shape parameter)
: number of failures + 1
- \( \Gamma \) is the
gamma function
The probability density function (PDF) of the Beta distribution is:
\[ f(x; \alpha, \beta) = \frac{\Gamma(\alpha + \beta)}{\Gamma(\alpha)\Gamma(\beta)} x^{\alpha - 1} (1 - x)^{\beta - 1}, \quad \text{for } 0 < x < 1 \]Example 1: Plot the distribution
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import beta
# Parameters
α, β = 2, 5 # Change these to explore different shapes
x = np.linspace(0, 1, 500)
pdf = beta.pdf(x, α, β)
# Plot
plt.figure(figsize=(6, 4))
plt.plot(x, pdf, linewidth=2)
plt.title(f'Beta Distribution (α = {α}, β = {β})')
plt.xlabel('x')
plt.ylabel('Probability Density')
plt.grid(True, linestyle='--', alpha=0.7)
plt.tight_layout()
plt.show()
Example 2: Bayesian A/B Testing
You're testing two versions of a website:
- Version A: shown to 100 users, with 40 clicks.
- Version B: shown to 100 users, with 50 clicks.
You want to estimate the probability, i.e. posterior distribution, of the click-through rate (CTR) for each version of the website. Assume a uniform prior : Beta(1, 1). This means, as in any Bayesian statistics start with the assumption that the CTR probability is equal to a uniform distribution.
Questions: 1. What are the posterior distributions of CTR for A and B? 2. What is the probability that B has a higher CTR than A?
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import beta
# Observed data
clicks_A, views_A = 40, 100
clicks_B, views_B = 50, 100
# Prior parameters (uniform prior)
alpha_prior, beta_prior = 1, 1
# Posterior parameters (these are determined through Bayesian inference. For now assume they are given)
alpha_A = alpha_prior + clicks_A
beta_A = beta_prior + views_A - clicks_A
alpha_B = alpha_prior + clicks_B
beta_B = beta_prior + views_B - clicks_B
# Sample from posteriors (you will learn sampling below. It means generate random data based on that distribution)
samples = 100_000
posterior_A = np.random.beta(alpha_A, beta_A, samples)
posterior_B = np.random.beta(alpha_B, beta_B, samples)
# Probability that B > A
prob_B_beats_A = np.mean(posterior_B > posterior_A)
print(f"Probability that version B has a higher CTR than version A: {prob_B_beats_A:.4f}")
# Plotting
x = np.linspace(0, 1, 1000)
plt.figure(figsize=(10, 6))
plt.plot(x, beta.pdf(x, alpha_A, beta_A), label='Posterior A', color='blue')
plt.plot(x, beta.pdf(x, alpha_B, beta_B), label='Posterior B', color='green')
plt.title('Posterior Distributions of Click-through Rates')
plt.xlabel('CTR')
plt.ylabel('Density')
plt.legend()
plt.grid(True)
plt.show()
5. Other Distributions :
There are many more continuous distribution and luckily they are mostly implemented in the Scipy library. Check them out here: - https://docs.scipy.org/doc/scipy/reference/stats.html#continuous-distributions
Also, so far we have looked at single variable distributions, i.e. when x is a scaler. However, there are distributions for when \(\vec{x}\) is a vector. One well known example is multi-variate Gaussian distribution. It is a critical distribution in machine learning and physics. Read more about it here: * https://en.wikipedia.org/wiki/Multivariate_normal_distribution * https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.multivariate_normal.html#scipy.stats.multivariate_normal
2. Mean, Variance, and Covariance
For Discrete Random Variables
Suppose \( X \) is a discrete random variable with values \( x_1, x_2, \dots, x_n \), and corresponding probabilities \( P(X = x_i) = p_i \).
Mean (Expected Value):
\[ \mu = \mathbb{E}[X] = \sum_{i} x_i \cdot p_i \]This represents the weighted average of all possible values, with each value weighted by its probability.
Variance:
\[ \text{Var}(X) = \mathbb{E}[(X - \mu)^2] = \sum_{i} (x_i - \mu)^2 \cdot p_i \]Alternatively, using a shortcut formula:
\[ \text{Var}(X) = \mathbb{E}[X^2] - (\mathbb{E}[X])^2 \]Where: - \( \mathbb{E}[X^2] = \sum_i x_i^2 \cdot p_i \)
Covariance:
Covariance is the higher-dimensional version of variance. If probability is a function of more than one variable, eg. X and Y, then
\[ \text{Cov}(X, Y) = \mathbb{E}[(X - \mu_X)(Y - \mu_Y)] = \sum_i (x_i - \mu_X)(y_i - \mu_Y) \cdot p_i \]Alternatively, using the shortcut formula:
\[ \text{Cov}(X, Y) = \mathbb{E}[XY] - \mathbb{E}[X] \cdot \mathbb{E}[Y] \]Where: - \( \mathbb{E}[XY] = \sum_i x_i y_i \cdot p_i \)
Note that \(\text{Cov}(Y, Y) = \text{Var}(Y)\). In other words, covariance is a matrix whose diagonal components are the variances of the variables.
Connection to Physics:
It is closely related to propagator in particle physics, Green's function in mathematical physics, and 2-point correlation function .
For Continuous Random Variables
Suppose \( X \) is a continuous random variable with probability density function (PDF) \( f(x) \).
Mean (Expected Value):
\[ \mu = \mathbb{E}[X] = \int_{-\infty}^{\infty} x \cdot f(x) \, dx \]Variance:
\[ \text{Var}(X) = \mathbb{E}[(X - \mu)^2] = \int_{-\infty}^{\infty} (x - \mu)^2 \cdot f(x) \, dx \]Alternatively:
\[ \text{Var}(X) = \mathbb{E}[X^2] - (\mathbb{E}[X])^2 \]Where: - \( \mathbb{E}[X^2] = \int_{-\infty}^{\infty} x^2 \cdot f(x) \, dx \)
Covariance:
Covariance is the higher-dimensional version of variance. If probability is defined over continuous variables, such as random variables \( X \) and \( Y \), then
\[ \text{Cov}(X, Y) = \mathbb{E}[(X - \mu_X)(Y - \mu_Y)] = \int (x - \mu_X)(y - \mu_Y) \cdot f(x, y) \, dx \, dy \]Alternatively, using the shortcut formula:
\[ \text{Cov}(X, Y) = \mathbb{E}[XY] - \mathbb{E}[X] \cdot \mathbb{E}[Y] \]Where: - \( \mathbb{E}[XY] = \int_{-\infty}^{\infty} \int_{-\infty}^{\infty} xy \cdot f(x, y) \, dx \, dy \)
Here, \( f(x, y) \) is the joint probability density function of \( X \) and \( Y \).
3. Sampling
Sampling means generating random data points that follow a specific probability distribution (e.g., normal, uniform, exponential). Samplig has many applications in data science. Noteable use cases are simulations, bootstrapping, testing statistical models, even cutting edge image generators are doing a type of sampling (annealed sampling).
🔧 Practical Sampling with
scipy
The
scipy.stats
module makes this easy. Every distribution in this library has a method named
rvs()
which can be used to
draw random samples
.
Here’s a quick example using three common distributions:
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm, uniform, expon
# Normal distribution (mean=0, std=1)
normal_samples = norm.rvs(loc=0, scale=1, size=1000)
# Uniform distribution (from 0 to 10)
uniform_samples = uniform.rvs(loc=0, scale=10, size=1000)
# Exponential distribution (lambda=1 => scale=1/lambda)
expon_samples = expon.rvs(scale=1, size=1000)
# Plotting
plt.hist(normal_samples, bins=30, alpha=0.5, label='Normal')
plt.hist(uniform_samples, bins=30, alpha=0.5, label='Uniform')
plt.hist(expon_samples, bins=30, alpha=0.5, label='Exponential')
plt.legend()
plt.title('Sampling from Distributions using SciPy')
plt.show()
Leave a Comment