Probability Basics

Abstract:

Probability distributions are fundamental in data science for purposes like building machine learning models, modeling uncertainty, and guiding informed decision-making. They define how the values of a set of random variables are distributed. Broadly, these distributions are classified into discrete and continuous types. Below is an overview of key probability distributions, along with their characteristics and common applications.

Discrete Probability Distributions

Discrete distributions apply to scenarios where the set of possible outcomes is countable. Key discrete distributions include:

1. Bernoulli Distribution :

The Bernoulli distribution models a single trial with exactly two possible outcomes:

Success with probability p
Failure with probability 1 – p

It is the simplest type of probability distribution and serves as the foundation for more complex distributions like the binomial distribution. In machine learning, it is closely related to the logistic regression.

Probability mass function:

\[ P(X = x) = p^x (1 - p)^{1 - x}, \quad x \in \{0, 1\} \]

Example 1: Coin Toss

A fair coin toss can be modeled using a Bernoulli distribution with p = 0.5 , where: - Success = Heads
- Failure = Tails

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import bernoulli

# Define the probability of success (e.g., heads)
p = 0.5
x = [0, 1]  # 0 = failure, 1 = success

# Get probability mass function values
pmf = bernoulli.pmf(x, p)

# Plot
plt.figure()
plt.bar(x, pmf, tick_label=["Failure (0)", "Success (1)"])
plt.title(f'Bernoulli Distribution (p = {p})')
plt.ylabel('Probability')
plt.xlabel('Outcome')
plt.ylim(0, 1)
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.tight_layout()
plt.show()

Example 2: Ad Click

A website visitor has a 30% chance of clicking on an ad. What is the probability that a randomly selected visitor clicks the ad?

This is a Bernoulli trial : - Only one trial (a single visitor), - Two possible outcomes: click (success = 1) or no click (failure = 0), - Probability of success \( p = 0.3 \).

from scipy.stats import bernoulli

# Probability of success
p = 0.3

# Bernoulli distribution
click_prob = bernoulli.pmf(1, p)  # P(click)
no_click_prob = bernoulli.pmf(0, p)  # P(no click)

print(f"Probability the visitor clicks the ad: {click_prob:.2f}")
print(f"Probability the visitor does NOT click the ad: {no_click_prob:.2f}")

2. Binomial Distribution :

The binomial distribution models the number of successes in a fixed number of independent Bernoulli trials , where each trial has the same probability of success p .

n = number of trials
p = probability of success in each trial
k = number of successes (0 to n)

Probability mass function:

\[ P(X = k) = \binom{n}{k} p^k (1 - p)^{n - k} \]

Example 1: 10 Coin Flips

If you flip a fair coin 10 times ( n = 10 , p = 0.5 ), the binomial distribution models the probability of getting k heads, where k can range from 0 to 10.

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import binom

# Parameters
n = 10     # number of trials
p = 0.5    # probability of success
x = np.arange(0, n+1)

# Probability mass function
pmf = binom.pmf(x, n, p)

# Plot
plt.figure()
plt.bar(x, pmf, color='skyblue')
plt.title(f'Binomial Distribution (n = {n}, p = {p})')
plt.xlabel('Number of Successes (Heads)')
plt.ylabel('Probability')
plt.xticks(x)
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.tight_layout()
plt.show()

Example 2:

A basketball player has a free-throw success rate of 80%. During a game, she takes 10 free throws. What is the probability that she makes exactly 8 of them?

This is a classic binomial probability problem where: - Each free throw is an independent trial, - The probability of success \( p = 0.8 \), - The number of trials \( n = 10 \), - We are interested in exactly \( k = 8 \) successes.

from scipy.stats import binom

# Parameters
n = 10        # number of trials
p = 0.8       # probability of success
k = 8         # desired number of successes

# Binomial probability
probability = binom.pmf(k, n, p)

print(f"Probability of exactly {k} successful free throws out of {n}: {probability:.4f}")

3. Poisson Distribution :

The Poisson distribution models the number of events occurring in a fixed interval of time or space , under the following assumptions:

Events occur independently of one another
The average rate of occurrence (λ) is constant
Two events cannot occur at exactly the same instant

It’s useful for modeling count-based events over time, such as system failures, web traffic, or call arrivals.

Probability mass function:

\[ P(X = k) = \frac{\mu^k e^{-\mu}}{k!}, \quad k = 0, 1, 2, \dots \]

Example: Call Center

Suppose a call center receives on average 4 calls per hour ( \(\mu\) = 4 ). The Poisson distribution can model the probability of receiving 0, 1, 2, ..., k calls in any given hour.

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import poisson

# Parameters
mu = 4  # average number of calls per hour
x = np.arange(0, 15)

# Probability mass function
pmf = poisson.pmf(x, mu)

# Plot
plt.figure()
plt.bar(x, pmf, color='salmon')
plt.title(f'Poisson Distribution (λ = {λ})')
plt.xlabel('Number of Calls per Hour')
plt.ylabel('Probability')
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.tight_layout()
plt.show()

4. Discrete Uniform Distribution :

The discrete uniform distribution assigns equal probability to all outcomes in a finite set . It is used when there is no prior reason to favor one outcome over another—each outcome is equally likely.

For a set of n possible outcomes:
\( P(X = x_i) = \frac{1}{n} \)

for all in the set.

Example: Fair Die Roll

Rolling a fair six-sided die is a classic example. Each face (1 through 6) has a probability of \( \frac{1}{6} \).

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import randint

# Parameters
low, high = 1, 7  # randint is [low, high), so high = 7 for die faces 1 to 6
x = np.arange(low, high)

# Probability mass function
pmf = randint.pmf(x, low, high)

# Plot
plt.figure()
plt.bar(x, pmf, color='lightgreen')
plt.title('Discrete Uniform Distribution (Fair 6-Sided Die)')
plt.xlabel('Die Face')
plt.ylabel('Probability')
plt.xticks(x)
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.tight_layout()
plt.show()

Continuous Probability Distributions

A continuous probability distribution describes the probabilities of a continuous random variable , which can take any real value in a given range.

Key Features:

It is described by a Probability Density Function (PDF) \( f(x) \).
Probabilities are calculated as areas under the curve of the PDF: \[ P(a \leq X \leq b) = \int_a^b f(x) \, dx \]
The probability of a single exact value is zero: \[ P(X = x) = 0 \]

🔁 Key Differences Between Discrete and Continuous Distributions

Feature	Discrete	Continuous
Possible values	Countable (e.g., 0, 1, 2, ...)	Uncountable (e.g., any real number)
Described by	PMF (Probability Mass Function)	PDF (Probability Density Function)
Probability of exact value	\( > 0 \)	\( = 0 \)
Probability computed by	Summing probabilities	Integrating density over an interval

1. Normal Distribution :

The normal distribution , also known as the Gaussian distribution , is characterized by its:

Symmetric, bell-shaped curve
Centered at the mean (μ)
Spread determined by the standard deviation (σ)

In physics, it is the Lagrangian (high energy physics), or effective free energy (stat mech) of a non-interactive field. In machine learning, it is closely related to linear regression.

It is foundational in statistics and machine learning due to the Central Limit Theorem , which states that the sum of many independent random variables tends toward a normal distribution.

Central Limit Theorem is closely related to the Landau effective field theory. In short, we can expand a general Lagrangian or effective free energy around the minimum and keep the lowest order, which leads to a Gaussian distribution.

Example: Human Heights

Human heights (in a population) often follow a normal distribution with a specific mean and standard deviation.

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm

# Parameters
mu = 170     # mean height (e.g., cm)
sigma = 10   # standard deviation

x = np.linspace(mu - 4*sigma, mu + 4*sigma, 500)
pdf = norm.pdf(x, mu, sigma)

# Plot
plt.figure()
plt.plot(x, pdf, linewidth=2)
plt.title(f'Normal Distribution (μ = {mu}, σ = {sigma})')
plt.xlabel('Height (cm)')
plt.ylabel('Probability Density')
plt.grid(True, linestyle='--', alpha=0.7)
plt.tight_layout()
plt.show()

2. Exponential Distribution :

The exponential distribution is a continuous probability distribution used to model the time between consecutive events in a Poisson process , where:

Events occur independently
At a constant average rate (λ) over time

It is commonly used to model waiting times , such as the time until a machine fails or the time until the next earthquake.

PDF:
\[ f(x; \lambda) = \lambda e^{-\lambda x} \quad \text{for } x \geq 0 \]

Example: Time Between Earthquakes

If earthquakes occur on average once every 10 days, the rate is \( \lambda = \frac{1}{10} = 0.1 \) per day.

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import expon

# Parameters
λ = 0.1              # average rate (1/mean time)
scale = 1 / λ        # scale = 1 / λ
x = np.linspace(0, 80, 500)
pdf = expon.pdf(x, scale=scale)

# Plot
plt.figure()
plt.plot(x, pdf, linewidth=2)
plt.title(f'Exponential Distribution (λ = {λ}, mean = {scale})')
plt.xlabel('Time Until Next Earthquake (days)')
plt.ylabel('Probability Density')
plt.grid(True, linestyle='--', alpha=0.7)
plt.tight_layout()
plt.show()

3. Gamma Distribution :

The gamma distribution generalizes the exponential distribution . While the exponential models the waiting time until the first event in a Poisson process, the gamma distribution models the waiting time until the kᵗʰ event .

The probability density function (PDF) of the Gamma distribution is defined as:

\[ f(x; \alpha, \beta) = \frac{\beta^\alpha}{\Gamma(\alpha)} x^{\alpha - 1} e^{-\beta x}, \quad \text{for } x > 0 \]

where:

\( \alpha > 0 \) is the shape parameter: number of events
\( \beta > 0 \) is the rate parameter (sometimes you might see scale = \(1/\beta \), the scale parameter),
\( \Gamma(\alpha) \) is the gamma function

Example: Time Until 3 Earthquakes

If earthquakes occur at a rate of 1 every 10 days (λ = 0.1), the gamma distribution models the time until 3 earthquakes occur.

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import gamma

# Parameters
a = 3            # shape (number of events)
λ = 0.1          # rate
scale = 1 / λ    # scale = 1 / λ
x = np.linspace(0, 100, 500)
pdf = gamma.pdf(x, a, scale)

# Plot
plt.figure()
plt.plot(x, pdf, linewidth=2)
plt.title(f'Gamma Distribution (alpha = {a}, λ = {λ}, mean = {a/λ})')
plt.xlabel('Time Until 3 Earthquakes (days)')
plt.ylabel('Probability Density')
plt.grid(True, linestyle='--', alpha=0.7)
plt.tight_layout()
plt.show()

4. Beta Distribution :

The beta distribution is defined on the interval \(x \in [0, 1]\). It is commonly used to model:

Proportions (e.g., conversion rates, success probabilities)
Probabilities of probabilities in Bayesian statistics , where it often serves as a conjugate prior for the binomial distribution.

It is parameterized by: - α (first shape parameter) : number of successes + 1
- β (second shape parameter) : number of failures + 1 - \( \Gamma \) is the gamma function

The probability density function (PDF) of the Beta distribution is:

\[ f(x; \alpha, \beta) = \frac{\Gamma(\alpha + \beta)}{\Gamma(\alpha)\Gamma(\beta)} x^{\alpha - 1} (1 - x)^{\beta - 1}, \quad \text{for } 0 < x < 1 \]

Example 1: Plot the distribution

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import beta

# Parameters
α, β = 2, 5  # Change these to explore different shapes
x = np.linspace(0, 1, 500)
pdf = beta.pdf(x, α, β)

# Plot
plt.figure(figsize=(6, 4))
plt.plot(x, pdf, linewidth=2)
plt.title(f'Beta Distribution (α = {α}, β = {β})')
plt.xlabel('x')
plt.ylabel('Probability Density')
plt.grid(True, linestyle='--', alpha=0.7)
plt.tight_layout()
plt.show()

Example 2: Bayesian A/B Testing

You're testing two versions of a website:

Version A: shown to 100 users, with 40 clicks.
Version B: shown to 100 users, with 50 clicks.

You want to estimate the probability, i.e. posterior distribution, of the click-through rate (CTR) for each version of the website. Assume a uniform prior : Beta(1, 1). This means, as in any Bayesian statistics start with the assumption that the CTR probability is equal to a uniform distribution.

Questions: 1. What are the posterior distributions of CTR for A and B? 2. What is the probability that B has a higher CTR than A?

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import beta

# Observed data
clicks_A, views_A = 40, 100
clicks_B, views_B = 50, 100

# Prior parameters (uniform prior)
alpha_prior, beta_prior = 1, 1

# Posterior parameters (these are determined through Bayesian inference. For now assume they are given)
alpha_A = alpha_prior + clicks_A
beta_A = beta_prior + views_A - clicks_A
alpha_B = alpha_prior + clicks_B
beta_B = beta_prior + views_B - clicks_B

# Sample from posteriors (you will learn sampling below. It means generate random data based on that distribution)
samples = 100_000
posterior_A = np.random.beta(alpha_A, beta_A, samples)
posterior_B = np.random.beta(alpha_B, beta_B, samples)

# Probability that B > A
prob_B_beats_A = np.mean(posterior_B > posterior_A)
print(f"Probability that version B has a higher CTR than version A: {prob_B_beats_A:.4f}")

# Plotting
x = np.linspace(0, 1, 1000)
plt.figure(figsize=(10, 6))
plt.plot(x, beta.pdf(x, alpha_A, beta_A), label='Posterior A', color='blue')
plt.plot(x, beta.pdf(x, alpha_B, beta_B), label='Posterior B', color='green')
plt.title('Posterior Distributions of Click-through Rates')
plt.xlabel('CTR')
plt.ylabel('Density')
plt.legend()
plt.grid(True)
plt.show()

5. Other Distributions :

There are many more continuous distribution and luckily they are mostly implemented in the Scipy library. Check them out here: - https://docs.scipy.org/doc/scipy/reference/stats.html#continuous-distributions

Also, so far we have looked at single variable distributions, i.e. when x is a scaler. However, there are distributions for when \(\vec{x}\) is a vector. One well known example is multi-variate Gaussian distribution. It is a critical distribution in machine learning and physics. Read more about it here: * https://en.wikipedia.org/wiki/Multivariate_normal_distribution * https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.multivariate_normal.html#scipy.stats.multivariate_normal

2. Mean, Variance, and Covariance

For Discrete Random Variables

Suppose \( X \) is a discrete random variable with values \( x_1, x_2, \dots, x_n \), and corresponding probabilities \( P(X = x_i) = p_i \).

Mean (Expected Value):

\[ \mu = \mathbb{E}[X] = \sum_{i} x_i \cdot p_i \]

This represents the weighted average of all possible values, with each value weighted by its probability.

Variance:

\[ \text{Var}(X) = \mathbb{E}[(X - \mu)^2] = \sum_{i} (x_i - \mu)^2 \cdot p_i \]

Alternatively, using a shortcut formula:

\[ \text{Var}(X) = \mathbb{E}[X^2] - (\mathbb{E}[X])^2 \]

Where: - \( \mathbb{E}[X^2] = \sum_i x_i^2 \cdot p_i \)

Covariance:

Covariance is the higher-dimensional version of variance. If probability is a function of more than one variable, eg. X and Y, then

\[ \text{Cov}(X, Y) = \mathbb{E}[(X - \mu_X)(Y - \mu_Y)] = \sum_i (x_i - \mu_X)(y_i - \mu_Y) \cdot p_i \]

Alternatively, using the shortcut formula:

\[ \text{Cov}(X, Y) = \mathbb{E}[XY] - \mathbb{E}[X] \cdot \mathbb{E}[Y] \]

Where: - \( \mathbb{E}[XY] = \sum_i x_i y_i \cdot p_i \)

Note that \(\text{Cov}(Y, Y) = \text{Var}(Y)\). In other words, covariance is a matrix whose diagonal components are the variances of the variables.

Connection to Physics:

It is closely related to propagator in particle physics, Green's function in mathematical physics, and 2-point correlation function .

For Continuous Random Variables

Suppose \( X \) is a continuous random variable with probability density function (PDF) \( f(x) \).

Mean (Expected Value):

\[ \mu = \mathbb{E}[X] = \int_{-\infty}^{\infty} x \cdot f(x) \, dx \]

Variance:

\[ \text{Var}(X) = \mathbb{E}[(X - \mu)^2] = \int_{-\infty}^{\infty} (x - \mu)^2 \cdot f(x) \, dx \]

Alternatively:

\[ \text{Var}(X) = \mathbb{E}[X^2] - (\mathbb{E}[X])^2 \]

Where: - \( \mathbb{E}[X^2] = \int_{-\infty}^{\infty} x^2 \cdot f(x) \, dx \)

Covariance:

Covariance is the higher-dimensional version of variance. If probability is defined over continuous variables, such as random variables \( X \) and \( Y \), then

\[ \text{Cov}(X, Y) = \mathbb{E}[(X - \mu_X)(Y - \mu_Y)] = \int (x - \mu_X)(y - \mu_Y) \cdot f(x, y) \, dx \, dy \]

Alternatively, using the shortcut formula:

\[ \text{Cov}(X, Y) = \mathbb{E}[XY] - \mathbb{E}[X] \cdot \mathbb{E}[Y] \]

Where: - \( \mathbb{E}[XY] = \int_{-\infty}^{\infty} \int_{-\infty}^{\infty} xy \cdot f(x, y) \, dx \, dy \)

Here, \( f(x, y) \) is the joint probability density function of \( X \) and \( Y \).

3. Sampling

Sampling means generating random data points that follow a specific probability distribution (e.g., normal, uniform, exponential). Samplig has many applications in data science. Noteable use cases are simulations, bootstrapping, testing statistical models, even cutting edge image generators are doing a type of sampling (annealed sampling).

🔧 Practical Sampling with `scipy`

The scipy.stats module makes this easy. Every distribution in this library has a method named rvs() which can be used to draw random samples .

Here’s a quick example using three common distributions:

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm, uniform, expon

# Normal distribution (mean=0, std=1)
normal_samples = norm.rvs(loc=0, scale=1, size=1000)

# Uniform distribution (from 0 to 10)
uniform_samples = uniform.rvs(loc=0, scale=10, size=1000)

# Exponential distribution (lambda=1 => scale=1/lambda)
expon_samples = expon.rvs(scale=1, size=1000)

# Plotting
plt.hist(normal_samples, bins=30, alpha=0.5, label='Normal')
plt.hist(uniform_samples, bins=30, alpha=0.5, label='Uniform')
plt.hist(expon_samples, bins=30, alpha=0.5, label='Exponential')
plt.legend()
plt.title('Sampling from Distributions using SciPy')
plt.show()

Discrete Probability Distributions

1. Bernoulli Distribution :

Example 1: Coin Toss

Example 2: Ad Click

2. Binomial Distribution :

Example 1: 10 Coin Flips

Example 2:

3. Poisson Distribution :

Example: Call Center

4. Discrete Uniform Distribution :

Example: Fair Die Roll

Continuous Probability Distributions

1. Normal Distribution :

Example: Human Heights

2. Exponential Distribution :

Example: Time Between Earthquakes

3. Gamma Distribution :

Example: Time Until 3 Earthquakes

4. Beta Distribution :

Example 1: Plot the distribution

Example 2: Bayesian A/B Testing

5. Other Distributions :

2. Mean, Variance, and Covariance

For Discrete Random Variables

Mean (Expected Value):

Variance:

Covariance:

Connection to Physics:

For Continuous Random Variables

Mean (Expected Value):

Variance:

Covariance:

3. Sampling

🔧 Practical Sampling with scipy

Leave a Comment

Comments

Are You a Physicist?

Join Our

FREE-or-Land-Job Data Science BootCamp

🔧 Practical Sampling with `scipy`