Probability and Random Processes

These are a bunch of useful ideas from undergraduate probability that I sometimes find useful in daily life.

Probability fundamentals

  1. Kolmogorov’s axioms:
    • Probabilities are non-negative
    • Probability of at least one possible outcome is one
    • For disjoint events $A$ and $B$, $P(A \cup B) = P(A) + P(B)$
  2. Conditional probability: $P(A, B) = P(A | B) P(B)$
    • Law of total probability: for a partition $\{B_i\}$ of $\Omega$, $P(A) = \sum_i P(A | B_i) P(B_i)$
  3. Independence implies uncorrelated (reverse is not necessarily true)
    • $X, Y$ are independent if for all $A, B$: $P(X \in A, Y \in B) = P(X \in A) P(Y \in B)$
    • Covariance: $\text{Cov}(X, Y) = \mathbb{E}[(X - \mathbb{E}[X])(Y - \mathbb{E}[Y])]$
    • Correlation: $\text{Corr}(X, Y) = \frac{\text{Cov}(X, Y)}{\sigma(X) \sigma(Y)}$ ($-1 \leq \text{Corr}(X, Y) \leq 1$ by Cauchy-Schwarz)

Common probability tools

  1. Expectation tricks:
    • Iterated expectation/tower rule: $\mathbb{E}[X] = \mathbb{E}[\mathbb{E}[X | Y]]$
    • Law of total variance: $\text{Var}(Y) = \mathbb{E}[\text{Var}(Y|X)] + \text{Var}(\mathbb{E}[Y|X])$
    • Variance sum: for pairwise uncorrelated $\{X_i\}$, $\text{Var}(\sum_i X_i) = \sum_i \text{Var}(X_i)$
    • Tail sum: $\mathbb{E}[X] = \int_{x=0}^\infty P(X > x)$
    • Moment generating function: $M_X(t) = \mathbb{E}[e^{tX}]$
      • eg. for Gaussian: $M_X(t) = \text{exp}(\mu t + \frac{1}{2} \sigma^2 t^2)$
      • The $n$-th moment of a random variable is $\mathbb{E}[X^n] = M^{(n)}_X(0) = \frac{d^n M_X}{d s^n} |_{s=0}$
  2. Probability manipulations:
    • Bayes’ rule: $P(A | B) = \frac{P(B | A) P(A)}{P(B)}$
    • Derived distributions: for $Y = g(X)$, $f_Y(y) = \frac{f_X(x)}{|g’(x)|}$
    • Inclusion-exclusion: $P(\bigcup_{i=1}^n A_i) = \sum_{k=1}^n (-1)^{k+1} \sum_{1 \leq i_1 < \dots < i_k \leq n} \Pr(A_{i_1} \cap \dots \cap A_{i_k})$
    • Order statistics: For continuous $X_i$, $f_{X^{(i)}}(y) = n \binom{n-1}{i-1} F_X(y)^{i-1} (1-F_X(y))^{n-i} f_{X}(y)$
    • Convolution: for indep. $X, Y$, $Z = X + Y$ has pdf $f_Z(z) = \int_{t = -\infty}^{\infty} f_X(z-t) f_Y(t) dt$
      • The MGF of $Z$ is $M_Z(t) = M_X(at) M_Y(bt)$
  3. Common problem solving techniques:
    • Counting: combinations, stars and bars, etc
    • Indicator variables: typically used to calculate expectation / variance
    • Symmetry: typically used to simplify calculations
    • Probabilistic method: showing proof of existence via randomly choosing

Common distributions

  1. Bernoulli: $X=1$ with probability $p$ and $X=0$ otherwise
    • Binomial: sum of iid Bernoullis
  2. Geometric/exponential: trials until success with unique memoryless property
    • Min of exponentials is exponential with rate $\sum_{j=1}^n \lambda_j$; $P(X_k = \min_i X_i) = \frac{\lambda_k}{\sum_{j=1}^n \lambda_j}$
    • Erlang($k, \lambda$) is the sum of $k$ indep. exponentials with rate $\lambda$
    • Poisson($\lambda$): the limit of a binomial as $n \to \infty$ and $p \to 0$ and $np \to \lambda$
      • Merging: for indep. $X \sim \text{Pois}(\lambda), Y \sim \text{Pois}(\mu)$, $X+Y \sim \text{Pois}(\lambda + \mu)$
      • Splitting: $\text{Poisson}(\lambda)$ with arrivals dropped indep. w.p. $p$ is $\text{Poisson}(\lambda p)$
  3. Gaussian: ubiquitous distribution commonly used for modeling noise
    • For independent $X \sim \mathcal{N}(\mu_1, \sigma_1^2), Y \sim \mathcal{N}(\mu_2, \sigma_2^2)$, $X + Y \sim \mathcal{N}(\mu_1 + \mu_2, \sigma_1^2 + \sigma_2^2)$
    • Jointly Gaussian: two random variables $X, Y$ are J.G. if the vector $(X, Y)$ is Gaussian
      • i.e. a J.G. vector $Y$ can be written as $Y = A Z + \mu$, where $Z$ is a vector of standard iid Gaussians
      • Uncorrelated jointly Gaussian RVs are independent
    • Laplace: L1 version of Gaussian, $f_X(x) = \frac{1}{2b} \text{exp}(-\frac{|x - \mu|}{b})$
  4. Exponential family: $f_X(x | \theta) = h(x) \text{exp}(\eta(\theta) T(x) - A(\theta))$ for $h$ nonnegative
    • $T(x)$ is the sufficient statistic of the distribution
    • $\eta$ is the natural parameter
    • $A(\eta)$ is the log-partition function
  5. Gamma distribution $\Gamma(\alpha, \beta)$; function $\Gamma(\alpha) = \int_0^\infty x^{\alpha-1} e^{-x} dx$:
    • Shape parameter $\alpha$: $\Gamma(\alpha + 1) = \alpha \Gamma(\alpha)$ ($\Gamma(n+1) = n!$)
    • Scale parameter $\beta$: if $X \sim \Gamma(\alpha, 1)$, $\beta X \sim \Gamma(\alpha, \beta)$
  6. Chi-square distribution with $p$ DOF: $\chi_p^2 = \sum_{i=1}^p Z_i^2$ for i.i.d. $Z_i \sim \mathcal{N}(0, 1)$

Concentration inequalities

  1. Union bound: $P(\bigcup_{i=1}^n A_i) \leq \sum_{i=1}^n P(A_i)$
  2. Markov’s inequality: $P(X \geq a) \leq \frac{\mathbb{E}[X]}{a}$ for nonnegative random variable $X$ and $a > 0$
  3. Chebyshev’s inequality: $P(|X - \mathbb{E}[X]| \geq c) \leq \frac{\text{Var}(X)}{c^2}$ (Markov’s on $|X - \mathbb{E}[X]|$)
  4. Chernoff’s inequality: $P(X \geq a) = P(e^{tX} \geq e^{ta}) \leq \frac{\mathbb{E}[e^{tX}]}{[e^{ta}]}$ (Markov’s on $e^{tX}$)
  5. Hoeffding’s inequality: $P(S_n - \mathbb{E}[S_n] \geq t) \leq \text{exp}(-\frac{2t^2}{\sum_i (b_i - a_i)^2})$ where $a_i \leq X_i \leq b_i$ a.s.

Convergence

  1. Borel-Cantelli Lemma: if $\sum_i P(A_i) < \infty$, then $P(\text{infinitely many } A_i \text{ occur}) = 0$
    • If $\sum_i P(|X_n - X| > \epsilon) < \infty$, then $X_n \to X$ a.s.
  2. Almost sure convergence: $X_n \xrightarrow[n \to \infty]{\text{a.s.}} X$ if $P(\lim_{n \to \infty} X_n = X) = 1$
    • i.e. the sequence $X^{(n)}$ deviates only a finite number of times from $X$
    • Strong Law of Large Numbers: empirical mean converges almost surely
  3. Convergence in probability: $X_n \xrightarrow[n \to \infty]{\text{i.p.}} X$ if $\lim_{n \to \infty} P(|X_n - X| > \epsilon) = 0$
    • i.e. the probability that $X_n$ deviates only from $X$ goes to zero (but can still deviate infinitely)
    • Weak Law of Large Numbers: empirical mean converges in probability
  4. Convergence in distribution: $X_n \xrightarrow[n \to \infty]{\text{d}} X$ if $\forall x$ with $P(X = x) = 0$, $P(X_n \leq x) \xrightarrow[n \to \infty]{} P(X \leq x)$
    • i.e. $X_n$ is modeled by the distribution $X$
    • Central Limit Theorem: distribution of outcomes converges to a standard normal
    • Markov Chains: state distribution converges to stationary distribution
  5. $L^r$ ($r$-th mean) convergence: $\text{lim}_{n \to \infty} \mathbb{E}[|X_n - X|^r] = 0$
    • Dominated convergence: if $X_n \to X$ a.s., $|X_n| < Y$, and $\mathbb{E}[Y] < \infty$, then $X_n \to X$ in $L^1$

Information theory

  1. Entropy: $H(X) = -\mathbb{E}[\text{log} [p(X)]]$
    • Chain rule for entropy: $H(X,Y) = H(X) + H(Y|X)$
    • Mutual information: $I(X; Y) = H(X) - H(X | Y)$
    • Data processing inequality: for a Markov chain $X \to Y \to Z$, $I(X; Y) \geq I(X; Z)$
  2. Huffman encoding: optimally compresses $X$ to $H(X)$ bits with a prefix code
    • Source coding theorem: cannot compress $X$ in less than $H(X)$ bits
    • Kraft-McMillan inequality: $\sum_w 2^{-l(w)} \leq 1$ for any prefix code
      • i.e. we don’t have to use a non-prefix code
  3. Asymptotic equipartition property: $P(|-\frac{1}{n} \text{log } p(X_1, \dots, X_n) - H(X)| \leq \epsilon) \to 1$ as $n \to \infty$
    • i.e. the sequence $\text{log } \frac{1}{p(x_i)}$ converges to $H(X)$ by the Law of Large Numbers
  4. KL divergence: $D_{KL}(p || q) = \mathbb{E}_p[\text{log} \frac{1}{q(X)}] - \mathbb{E}_p[\text{log} \frac{1}{p(X)}]$
    • i.e. the number of extra bits from improper compression of $p$
    • Total variation distance: $TVD(p || q) = \text{max}_x|p(x) - q(x)|$
      • Pinsker’s inequality: $TVD(p || q) \leq \sqrt{\frac{1}{2}D_{KL}(p || q)}$
    • Jensen-Shannon divergence: $JSD(p || q) = \frac{1}{2}(D(p || m) + D(q || m)), m = \frac{1}{2}(p + q)$
  5. Channel coding theorem: channel capacity $C = \text{max}_{p(X)} I(X; Y)$
    • Binary erasure channel: bit erased with probability $p$, has capacity $C = 1 - p$
    • Binary symmetric channel: bit swapped with probability $p$, has capacity $C = 1 - H(p)$

Markov chains

Discrete (DTMC):

  1. Markov chains satisfy the Markov property: $P(X_n | X_{n-1}, X_{n-2}, …) = P(X_n | X_{n-1})$
    • Common properties: recurrence (positive, null), transience, irreducibility, periodicity, reversibility
    • Solving Markov chains: stationary distribution ($\pi = \pi P$), first step equations, detailed balance
  2. Big theorem, stationary distribution, balance equations:
    • Detailed (a.k.a. local) balance equations hold if the Markov chain as a tree structure
    • Flow-in/flow-out holds for any cut, extends detailed balance equations
    • Stationary distribution exists for a class iff it is positive recurrent
      • If it exists, the stationary distribution for a communicating class is unique
    • If the whole chain is irreducible, then there is a unique stationary distribution
    • If whole chain is also aperiodic, then the chain converges a.s. to the stationary dist for any initial dist
  3. Other useful properties of Markov chains:
    • For undirected graphs, $\pi(i) = \frac{\text{degree}(i)}{2 E}$, where $E$ is the number of edges in the graph
    • The reciprocal of the stationary dist is the expected time to return to a state, starting from that state
  4. Applications:
    • MCMC: set up a MC and sample from the stationary dist as a proxy for sampling from the original dist
    • Erdos-Renyi random graphs: $n$ vertices, with each edge independently picked to be in the graph w.p. $p$

Continuous (CTMC):

  1. CTMCs have exponential transitions, rather than discrete
    • Holding time is the min of exponentials
    • Has rate matrix, detailed balance equations, recurrence, transience
    • Stationary distribution satisfies $\pi Q = 0$
  2. Jump/embedded chain: create a DTMC that models the “jumps” of a CTMC
    • i.e. visitation order of the states, by considering transition probabilities as the min of exponentials
    • Transition probability from $k$ to $j$ is $P(k, j) = \frac{\lambda_{k,j}}{\sum_{i=1}^n \lambda_{k,i}}$
    • Crucially, no self loops, so does not take into account holding time
  3. Uniformization: create a DTMC by relating the rates in terms of a fixed discrete rate $\lambda$
    • Transition probability from $k$ to $j$ (for $k \neq j$) is $P(k, j) = \frac{\lambda_{k, j}}{\lambda}$
    • Transition probability from $k$ to $k$ (self-loop) is $P(k, k) = 1 - \sum_{i=1, i \neq k}^n P(k, i)$
    • Also can write transition matrix $P$ in terms of rate matrix $Q$ as $P = I + \frac{1}{\lambda} Q$
    • Has the same stationary distribution: $\pi P = \pi (\frac{1}{\lambda} Q + I) = \pi (0 + I) = \pi$
  4. Poisson process: number of arrivals in time $t$ is Poisson($\lambda t$) (with indep. non-overlapping intervals)
    • Merging: the sum of indep. Poisson processes with rates $\lambda, \mu$ is a new Poisson process with rate $\lambda + \mu$
    • Splitting: for Poisson process w/ rate $\lambda$ and drop arrivals w.p. $p$, we have a Poisson process w/ rate $p \lambda$
    • $T_k \sim \text{Erlang}(k, \lambda)$ is the distribution of sum of $k$ independent exponentials with rate $\lambda$
    • Conditioned on $n$ at time $t$ ($N_t = n$), the arrivals are distributed uniformly
      • e.g. $\mathbb{E}[T_{i+1} - T_i] = \frac{t}{n+1}$
    • Random incidence paradox: from the perspective of a point, the expected interarrival time is doubled
      • A Poisson process backwards is still a Poisson process

Hypothesis testing and statistics

  1. Neyman-Pearson: a form of frequentist hypothesis testing, where we assume no prior over parameter $X$
    • Suppose there are two outcomes, either $X=0$ (null) or $X=1$, the alternate hypothesis
    • Since we have no prior, there is no notion of the “most likely” outcome
    • Probability of False Alarm (PFA): $P(\hat{X}=1 | X=0)$ (“type 1” error)
    • Probability of Correct Detection (PCD): $P(\hat{X}=1 | X=1)$ (“type 2” error)
    • Goal: maximize PCD such that PFA is less than “budget” $\beta$
  2. ROC curve: maximizing PCD is equivalent to maximizing PFA subject to the PFA constraint $\beta$
    • AUC is the probability a random positive sample is ranked higher than a random negative sample
  3. $p$-value: given that $X = 0$ is true, what is the probability we observed this data ($\hat{X} = 1$)?
  4. Sufficient statistic: $t = T(X)$ is sufficient for parameter $\theta$ if $P(X | t)$ does not depend on $\theta$
    • Fisher–Neyman factorization theorem: $T$ is sufficient iff $p_\theta(x) = g_\theta(T(x)) h(x)$ for some $g_\theta, h$
      • i.e. the product of two functions where only $g$ depends on $\theta$ (through $T$)
    • Minimal sufficiency: $T$ is minimal sufficient if it can be reconstructed from any sufficient statistic
    • Completeness: $T$ is complete if $\mathbb{E}_\theta[f(T)] = 0$ implies $f(T) = 0$ a.s.
      • If $T$ is complete and sufficient, then $T$ is minimal sufficient
    • $V$ is ancillary if its distribution does not depend on $\theta$
      • Basu’s theorem: if $T$ is complete sufficient and $V$ ancillary, they are independent

Estimation

  1. For an estimator $\hat{y} = f(X)$:
    • Expected error: $\mathbb{E}[(f(X) - Y)^2]$
    • Bias: $\mathbb{E}[f(X) - Y]$
    • Bias-variance tradeoff: $\mathbb{E}[(f(X) - Y)^2] = (\text{Bias}(f))^2 + \text{Var}(f) + \sigma_Y^2$
  2. Maximum likelihood estimation: find parameters $\theta$ that maximize likelihood $l(X | \theta)$
    • The MLE estimator can be biased, ex. German tank problem, variance of a Gaussian from samples
    • MLE is a special case of MAP, where the prior over $\theta$ is uniform
  3. Maximum a posteriori estimation: find parameters $\theta$ that maximize likelihood $l(X | \theta) f(\theta)$
    • ex. $L(x) = \sum_{i=1}^n (y_i - wx_i)^2 + \lambda |w|$ corresponds to a Laplace prior
  4. Minimum Mean Square Estimation (MMSE): find the best function $\phi$ to minimize $\mathbb{E}[(Y - \phi(X))^2]$
    • Consider Hilbert space where $\langle X, Y \rangle = \mathbb{E}[XY]$, with corresponding norm $| X |^2 = \mathbb{E}[X^2]$
    • The MMSE estimator is $\mathbb{E}[Y | X]$ ($\langle Y - \phi(X), f(X) \rangle = 0$ for all other $f$)
    • Both the LLSE and MMSE are unbiased, as $\mathbb{E}[X - \mathbb{E}[X]] = \mathbb{E}[\mathbb{E}[Y|X] - \mathbb{E}[Y]] = 0$
  5. Linear Least Squares Estimation (LLSE): best sq. error linear estimator of $Y$ given $1, X$
    • $L[Y|X] = \mathbb{E}[Y] + \frac{\text{cov}(X,Y)}{\text{var}(X)} (X - \mathbb{E}[X])$
    • Projection of $Y$ onto both $\tilde{X}$ and $1$, where $\tilde{X}$ is the transformed $X$ such that $\langle \tilde{X}, 1 \rangle = 0$
    • If the noise is Gaussian, the LLSE is also the MMSE
    • Kalman filtering: given observations $Y_1, …, Y_n$, compute $\mathbb{E}[X_n | Y_1, Y_2, … Y_n]$
      • Smoothing: given $Y_1, …, Y_n$, estimate the past $X_i$ for $i<n$: $\mathbb{E}[X_i | Y_1, Y_2, …, Y_n]$
  6. Rao-Blackwell: for $T$ sufficient and $L$ convex, $L(y, \mathbb{E}[\hat{y}|T]) \leq L(y, \hat{y})$ (strict if $L$ is strict)
    • i.e., a randomized estimator $\hat{y}(T)$ is worse than the non-randomized one $\mathbb{E}[\hat{y}|T]$
  7. Cramer-Rao: for an unbiased estimator $\hat{y}$, $\text{Var}(\hat{y}) \geq \frac{1}{J(y)}$
    • i.e. the variance of any unbiased estimator is bounded by the reciprocal of the Fisher information
    • Fisher information: $J(y) = \text{Var}(\nabla_y \text{log}(p_y(x)))$ (note: commonly use $\theta$ instead of $y$)
    • Efficiency: $\frac{(J(y))^{-1}}{\text{Var}(\hat{y})}$ (if equals one for all $y$, $\hat{y}$ is fully efficient; may not be possible)

Addendum

At Berkeley, we had a tradition of creating “study guides” before exams, which are still useful to me today when looking up old material. Back when I was a teaching assistant, I wrote a study guide for our class on Probability and Random Processes that is the base for this webpage, which you can find here.

I plan on slowly adding material to this over time as I get around to it. It’s mostly here as a convenient resource for me to look up old content.

Notes mentioning this note

There are no notes linking to this note.