Probability and Random Processes

Published May 2021

These are a bunch of useful ideas from undergraduate probability that I sometimes find useful in daily life.

Probability fundamentals

Kolmogorov’s axioms:
- Probabilities are non-negative
- Probability of at least one possible outcome is one
- For disjoint events $A$ and $B$ , $P (A \cup B) = P (A) + P (B)$
Conditional probability: $P (A, B) = P (A | B) P (B)$
- Law of total probability: for a partition ${B_{i}}$ of $Ω$ , $P (A) = \sum_{i} P (A | B_{i}) P (B_{i})$
Independence implies uncorrelated (reverse is not necessarily true)
- $X, Y$ are independent if for all $A, B$ : $P (X \in A, Y \in B) = P (X \in A) P (Y \in B)$
- Covariance: $Cov (X, Y) = E [(X - E [X]) (Y - E [Y])]$
- Correlation: $Corr (X, Y) = \frac{Cov (X, Y)}{σ (X) σ (Y)}$ ( $- 1 \leq Corr (X, Y) \leq 1$ by Cauchy-Schwarz)

Common probability tools

Expectation tricks:
- Iterated expectation/tower rule: $E [X] = E [E [X | Y]]$
- Law of total variance: $Var (Y) = E [Var (Y | X)] + Var (E [Y | X])$
- Variance sum: for pairwise uncorrelated ${X_{i}}$ , $Var (\sum_{i} X_{i}) = \sum_{i} Var (X_{i})$
- Tail sum: $E [X] = \int_{x = 0}^{\infty} P (X > x)$
- Moment generating function: $M_{X} (t) = E [e^{t X}]$
  - eg. for Gaussian: $M_{X} (t) = exp (μ t + \frac{1}{2} σ^{2} t^{2})$
  - The $n$ -th moment of a random variable is $E [X^{n}] = M_{X}^{(n)} (0) = \frac{d^{n} M_{X}}{d s^{n}} |_{s = 0}$
Probability manipulations:
- Bayes’ rule: $P (A | B) = \frac{P (B | A) P (A)}{P (B)}$
- Derived distributions: for $Y = g (X)$ , $f_{Y} (y) = \frac{f_{X} (x)}{| g^{'} (x) |}$
- Inclusion-exclusion: $P (⋃_{i = 1}^{n} A_{i}) = \sum_{k = 1}^{n} (- 1)^{k + 1} \sum_{1 \leq i_{1} < \dots < i_{k} \leq n} Pr (A_{i_{1}} \cap \dots \cap A_{i_{k}})$
- Order statistics: For continuous $X_{i}$ , $f_{X^{(i)}} (y) = n (\binom{n - 1}{i - 1}) F_{X} (y)^{i - 1} (1 - F_{X} (y))^{n - i} f_{X} (y)$
- Convolution: for indep. $X, Y$ , $Z = X + Y$ has pdf $f_{Z} (z) = \int_{t = - \infty}^{\infty} f_{X} (z - t) f_{Y} (t) d t$
  - The MGF of $Z$ is $M_{Z} (t) = M_{X} (a t) M_{Y} (b t)$
Common problem solving techniques:
- Counting: combinations, stars and bars, etc
- Indicator variables: typically used to calculate expectation / variance
- Symmetry: typically used to simplify calculations
- Probabilistic method: showing proof of existence via randomly choosing

Common distributions

Bernoulli: $X = 1$ with probability $p$ and $X = 0$ otherwise
- Binomial: sum of iid Bernoullis
Geometric/exponential: trials until success with unique memoryless property
- Min of exponentials is exponential with rate $\sum_{j = 1}^{n} λ_{j}$ ; $P (X_{k} = min_{i} X_{i}) = \frac{λ_{k}}{\sum_{j = 1}^{n} λ_{j}}$
- Erlang( $k, λ$ ) is the sum of $k$ indep. exponentials with rate $λ$
- Poisson( $λ$ ): the limit of a binomial as $n \to \infty$ and $p \to 0$ and $n p \to λ$
  - Merging: for indep. $X \sim Pois (λ), Y \sim Pois (μ)$ , $X + Y \sim Pois (λ + μ)$
  - Splitting: $Poisson (λ)$ with arrivals dropped indep. w.p. $p$ is $Poisson (λ p)$
Gaussian: ubiquitous distribution commonly used for modeling noise
- For independent $X \sim N (μ_{1}, σ_{1}^{2}), Y \sim N (μ_{2}, σ_{2}^{2})$ , $X + Y \sim N (μ_{1} + μ_{2}, σ_{1}^{2} + σ_{2}^{2})$
- Jointly Gaussian: two random variables $X, Y$ are J.G. if the vector $(X, Y)$ is Gaussian
  - i.e. a J.G. vector $Y$ can be written as $Y = A Z + μ$ , where $Z$ is a vector of standard iid Gaussians
  - Uncorrelated jointly Gaussian RVs are independent
- Laplace: L1 version of Gaussian, $f_{X} (x) = \frac{1}{2 b} exp (- \frac{| x - μ |}{b})$
Exponential family: $f_{X} (x | θ) = h (x) exp (η (θ) T (x) - A (θ))$ for $h$ nonnegative
- $T (x)$ is the sufficient statistic of the distribution
- $η$ is the natural parameter
- $A (η)$ is the log-partition function
Gamma distribution $Γ (α, β)$ ; function $Γ (α) = \int_{0}^{\infty} x^{α - 1} e^{- x} d x$ :
- Shape parameter $α$ : $Γ (α + 1) = α Γ (α)$ ( $Γ (n + 1) = n!$ )
- Scale parameter $β$ : if $X \sim Γ (α, 1)$ , $β X \sim Γ (α, β)$
Chi-square distribution with $p$ DOF: $χ_{p}^{2} = \sum_{i = 1}^{p} Z_{i}^{2}$ for i.i.d. $Z_{i} \sim N (0, 1)$

Concentration inequalities

Union bound: $P (⋃_{i = 1}^{n} A_{i}) \leq \sum_{i = 1}^{n} P (A_{i})$
Markov’s inequality: $P (X \geq a) \leq \frac{E [X]}{a}$ for nonnegative random variable $X$ and $a > 0$
Chebyshev’s inequality: $P (| X - E [X] | \geq c) \leq \frac{Var (X)}{c^{2}}$ (Markov’s on $| X - E [X] |$ )
Chernoff’s inequality: $P (X \geq a) = P (e^{t X} \geq e^{t a}) \leq \frac{E [e^{t X}]}{[e^{t a}]}$ (Markov’s on $e^{t X}$ )
Hoeffding’s inequality: $P (S_{n} - E [S_{n}] \geq t) \leq exp (- \frac{2 t^{2}}{\sum_{i} (b_{i} - a_{i})^{2}})$ where $a_{i} \leq X_{i} \leq b_{i}$ a.s.

Convergence

Borel-Cantelli Lemma: if $\sum_{i} P (A_{i}) < \infty$ , then $P (infinitely many A_{i} occur) = 0$
- If $\sum_{i} P (| X_{n} - X | > ϵ) < \infty$ , then $X_{n} \to X$ a.s.
Almost sure convergence: $X_{n} \to_{n \to \infty}^{a.s.} X$ if $P (lim_{n \to \infty} X_{n} = X) = 1$
- i.e. the sequence $X^{(n)}$ deviates only a finite number of times from $X$
- Strong Law of Large Numbers: empirical mean converges almost surely
Convergence in probability: $X_{n} \to_{n \to \infty}^{i.p.} X$ if $lim_{n \to \infty} P (| X_{n} - X | > ϵ) = 0$
- i.e. the probability that $X_{n}$ deviates only from $X$ goes to zero (but can still deviate infinitely)
- Weak Law of Large Numbers: empirical mean converges in probability
Convergence in distribution: $X_{n} \to_{n \to \infty}^{d} X$ if $\forall x$ with $P (X = x) = 0$ , $P (X_{n} \leq x) \to_{n \to \infty}^{} P (X \leq x)$
- i.e. $X_{n}$ is modeled by the distribution $X$
- Central Limit Theorem: distribution of outcomes converges to a standard normal
- Markov Chains: state distribution converges to stationary distribution
$L^{r}$ ( $r$ -th mean) convergence: ${lim}_{n \to \infty} E [| X_{n} - X |^{r}] = 0$
- Dominated convergence: if $X_{n} \to X$ a.s., $| X_{n} | < Y$ , and $E [Y] < \infty$ , then $X_{n} \to X$ in $L^{1}$

Information theory

Entropy: $H (X) = - E [log [p (X)]]$
- Chain rule for entropy: $H (X, Y) = H (X) + H (Y | X)$
- Mutual information: $I (X; Y) = H (X) - H (X | Y)$
- Data processing inequality: for a Markov chain $X \to Y \to Z$ , $I (X; Y) \geq I (X; Z)$
Huffman encoding: optimally compresses $X$ to $H (X)$ bits with a prefix code
- Source coding theorem: cannot compress $X$ in less than $H (X)$ bits
- Kraft-McMillan inequality: $\sum_{w} 2^{- l (w)} \leq 1$ for any prefix code
  - i.e. we don’t have to use a non-prefix code
Asymptotic equipartition property: $P (| - \frac{1}{n} log p (X_{1}, \dots, X_{n}) - H (X) | \leq ϵ) \to 1$ as $n \to \infty$
- i.e. the sequence $log \frac{1}{p (x_{i})}$ converges to $H (X)$ by the Law of Large Numbers
KL divergence: $D_{K L} (p | | q) = E_{p} [log \frac{1}{q (X)}] - E_{p} [log \frac{1}{p (X)}]$
- i.e. the number of extra bits from improper compression of $p$
- Total variation distance: $T V D (p | | q) = {max}_{x} | p (x) - q (x) |$
  - Pinsker’s inequality: $T V D (p | | q) \leq \sqrt{\frac{1}{2} D_{K L} (p | | q)}$
- Jensen-Shannon divergence: $J S D (p | | q) = \frac{1}{2} (D (p | | m) + D (q | | m)), m = \frac{1}{2} (p + q)$
Channel coding theorem: channel capacity $C = {max}_{p (X)} I (X; Y)$
- Binary erasure channel: bit erased with probability $p$ , has capacity $C = 1 - p$
- Binary symmetric channel: bit swapped with probability $p$ , has capacity $C = 1 - H (p)$

Markov chains

Discrete (DTMC):

Markov chains satisfy the Markov property: $P (X_{n} | X_{n - 1}, X_{n - 2}, \dots) = P (X_{n} | X_{n - 1})$
- Common properties: recurrence (positive, null), transience, irreducibility, periodicity, reversibility
- Solving Markov chains: stationary distribution ( $π = π P$ ), first step equations, detailed balance
Big theorem, stationary distribution, balance equations:
- Detailed (a.k.a. local) balance equations hold if the Markov chain as a tree structure
- Flow-in/flow-out holds for any cut, extends detailed balance equations
- Stationary distribution exists for a class iff it is positive recurrent
  - If it exists, the stationary distribution for a communicating class is unique
- If the whole chain is irreducible, then there is a unique stationary distribution
- If whole chain is also aperiodic, then the chain converges a.s. to the stationary dist for any initial dist
Other useful properties of Markov chains:
- For undirected graphs, $π (i) = \frac{degree (i)}{2 E}$ , where $E$ is the number of edges in the graph
- The reciprocal of the stationary dist is the expected time to return to a state, starting from that state
Applications:
- MCMC: set up a MC and sample from the stationary dist as a proxy for sampling from the original dist
- Erdos-Renyi random graphs: $n$ vertices, with each edge independently picked to be in the graph w.p. $p$

Continuous (CTMC):

CTMCs have exponential transitions, rather than discrete
- Holding time is the min of exponentials
- Has rate matrix, detailed balance equations, recurrence, transience
- Stationary distribution satisfies $π Q = 0$
Jump/embedded chain: create a DTMC that models the “jumps” of a CTMC
- i.e. visitation order of the states, by considering transition probabilities as the min of exponentials
- Transition probability from $k$ to $j$ is $P (k, j) = \frac{λ_{k, j}}{\sum_{i = 1}^{n} λ_{k, i}}$
- Crucially, no self loops, so does not take into account holding time
Uniformization: create a DTMC by relating the rates in terms of a fixed discrete rate $λ$
- Transition probability from $k$ to $j$ (for $k \neq j$ ) is $P (k, j) = \frac{λ_{k, j}}{λ}$
- Transition probability from $k$ to $k$ (self-loop) is $P (k, k) = 1 - \sum_{i = 1, i \neq k}^{n} P (k, i)$
- Also can write transition matrix $P$ in terms of rate matrix $Q$ as $P = I + \frac{1}{λ} Q$
- Has the same stationary distribution: $π P = π (\frac{1}{λ} Q + I) = π (0 + I) = π$
Poisson process: number of arrivals in time $t$ is Poisson( $λ t$ ) (with indep. non-overlapping intervals)
- Merging: the sum of indep. Poisson processes with rates $λ, μ$ is a new Poisson process with rate $λ + μ$
- Splitting: for Poisson process w/ rate $λ$ and drop arrivals w.p. $p$ , we have a Poisson process w/ rate $p λ$
- $T_{k} \sim Erlang (k, λ)$ is the distribution of sum of $k$ independent exponentials with rate $λ$
- Conditioned on $n$ at time $t$ ( $N_{t} = n$ ), the arrivals are distributed uniformly
  - e.g. $E [T_{i + 1} - T_{i}] = \frac{t}{n + 1}$
- Random incidence paradox: from the perspective of a point, the expected interarrival time is doubled
  - A Poisson process backwards is still a Poisson process

Hypothesis testing and statistics

Neyman-Pearson: a form of frequentist hypothesis testing, where we assume no prior over parameter $X$
- Suppose there are two outcomes, either $X = 0$ (null) or $X = 1$ , the alternate hypothesis
- Since we have no prior, there is no notion of the “most likely” outcome
- Probability of False Alarm (PFA): $P (\hat{X} = 1 | X = 0)$ (“type 1” error)
- Probability of Correct Detection (PCD): $P (\hat{X} = 1 | X = 1)$ (“type 2” error)
- Goal: maximize PCD such that PFA is less than “budget” $β$
ROC curve: maximizing PCD is equivalent to maximizing PFA subject to the PFA constraint $β$
- AUC is the probability a random positive sample is ranked higher than a random negative sample
$p$ -value: given that $X = 0$ is true, what is the probability we observed this data ( $\hat{X} = 1$ )?
Sufficient statistic: $t = T (X)$ is sufficient for parameter $θ$ if $P (X | t)$ does not depend on $θ$
- Fisher–Neyman factorization theorem: $T$ is sufficient iff $p_{θ} (x) = g_{θ} (T (x)) h (x)$ for some $g_{θ}, h$
  - i.e. the product of two functions where only $g$ depends on $θ$ (through $T$ )
- Minimal sufficiency: $T$ is minimal sufficient if it can be reconstructed from any sufficient statistic
- Completeness: $T$ is complete if $E_{θ} [f (T)] = 0$ implies $f (T) = 0$ a.s.
  - If $T$ is complete and sufficient, then $T$ is minimal sufficient
- $V$ is ancillary if its distribution does not depend on $θ$
  - Basu’s theorem: if $T$ is complete sufficient and $V$ ancillary, they are independent

Estimation

For an estimator $\hat{y} = f (X)$ :
- Expected error: $E [(f (X) - Y)^{2}]$
- Bias: $E [f (X) - Y]$
- Bias-variance tradeoff: $E [(f (X) - Y)^{2}] = (Bias (f))^{2} + Var (f) + σ_{Y}^{2}$
Maximum likelihood estimation: find parameters $θ$ that maximize likelihood $l (X | θ)$
- The MLE estimator can be biased, ex. German tank problem, variance of a Gaussian from samples
- MLE is a special case of MAP, where the prior over $θ$ is uniform
Maximum a posteriori estimation: find parameters $θ$ that maximize likelihood $l (X | θ) f (θ)$
- ex. $L (x) = \sum_{i = 1}^{n} (y_{i} - w x_{i})^{2} + λ | w |$ corresponds to a Laplace prior
Minimum Mean Square Estimation (MMSE): find the best function $ϕ$ to minimize $E [(Y - ϕ (X))^{2}]$
- Consider Hilbert space where $⟨ X, Y ⟩ = E [X Y]$ , with corresponding norm $| X |^{2} = E [X^{2}]$
- The MMSE estimator is $E [Y | X]$ ( $⟨ Y - ϕ (X), f (X) ⟩ = 0$ for all other $f$ )
- Both the LLSE and MMSE are unbiased, as $E [X - E [X]] = E [E [Y | X] - E [Y]] = 0$
Linear Least Squares Estimation (LLSE): best sq. error linear estimator of $Y$ given $1, X$
- $L [Y | X] = E [Y] + \frac{cov (X, Y)}{var (X)} (X - E [X])$
- Projection of $Y$ onto both $\tilde{X}$ and $1$ , where $\tilde{X}$ is the transformed $X$ such that $⟨ \tilde{X}, 1 ⟩ = 0$
- If the noise is Gaussian, the LLSE is also the MMSE
- Kalman filtering: given observations $Y_{1}, \dots, Y_{n}$ , compute $E [X_{n} | Y_{1}, Y_{2}, \dots Y_{n}]$
  - Smoothing: given $Y_{1}, \dots, Y_{n}$ , estimate the past $X_{i}$ for $i < n$ : $E [X_{i} | Y_{1}, Y_{2}, \dots, Y_{n}]$
Rao-Blackwell: for $T$ sufficient and $L$ convex, $L (y, E [\hat{y} | T]) \leq L (y, \hat{y})$ (strict if $L$ is strict)
- i.e., a randomized estimator $\hat{y} (T)$ is worse than the non-randomized one $E [\hat{y} | T]$
Cramer-Rao: for an unbiased estimator $\hat{y}$ , $Var (\hat{y}) \geq \frac{1}{J (y)}$
- i.e. the variance of any unbiased estimator is bounded by the reciprocal of the Fisher information
- Fisher information: $J (y) = Var (\nabla_{y} log (p_{y} (x)))$ (note: commonly use $θ$ instead of $y$ )
- Efficiency: $\frac{(J (y))^{- 1}}{Var (\hat{y})}$ (if equals one for all $y$ , $\hat{y}$ is fully efficient; may not be possible)

Addendum

At Berkeley, we had a tradition of creating “study guides” before exams, which are still useful to me today when looking up old material. Back when I was a teaching assistant, I wrote a study guide for our class on Probability and Random Processes that is the base for this webpage, which you can find here.

I plan on slowly adding material to this over time as I get around to it. It’s mostly here as a convenient resource for me to look up old content.

Notes mentioning this note

There are no notes linking to this note.

Probability and Random Processes

Probability fundamentals

Common probability tools

Common distributions

Concentration inequalities

Convergence

Information theory

Markov chains

Hypothesis testing and statistics

Estimation

Addendum

Notes mentioning this note

Table of Contents

Notes mentioning this note